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Preface 



International Federation of Classification Societies 

The International Federation of Classification Societies (IFCS) is an agency 
for the dissemination of technical and scientific information concerning 
classification and data analysis in the broad sense and in as wide a- range of 
applications as possible; founded in 1985 in Cambridge (UK) from the following 
Scientific Societies and Groups: British Classification Society - BCS; Classification 
Society of North America -CSNA; Gesellschaft fur Klassifikation - GfKl; Japanese 
Classification Society - JCS; Classification Group of Italian Statistical Society - 
CGSIS; Societe Francophone de Classification - SFC. Now the IFCS includes the 
following Societies; Dutch-Belgian Classification Society - VOC; Polish 
Classification Section - SKAD; Portuguese Classification Association - CLAD; 
Group-at-Large; Korean Classification Society - KCS. 



Biannual Meeting of the Classification and Data Analysis 
Group of SIS 

The biannual meeting of the Classification and Data Analysis Group of 
Societtl Italiana di Statistica (SIS) was held in Pescara, July 3-4, 1997. 

The 69 papers presented were divided in 17 sessions. Each session was 
organized by a chairperson with two invited speakers and two contributed papers 
from a call for papers. All the works were referred. Furthermore, during the 
meeting a discussant was provided for each session. A short version of the papers 
(4 pages) was published before the conference. 

The scientific program covered the following topics; 

• CUtssificalum Theory 

Fuzzy Methods - Hierarchical Classification - Non Hierarchical Classification - 
Optimisation approach in Classification. - Classification of Multiway Data - 
Probabilistic Methods for Clustering - Consensus and Comparison Theories in 
Classification - Spatial data and Clustering - Validity of Clustering - Neural 
Networks and Classification - Genetic Algorithms - Classification with Constraints 

• MuMvariate Data Anafysis 

Categorical Data Analysis - Factor Analysis and Related Methods - Discrimination 
and Classification - Visual Treatment in Data Analysis Symbolic Data Analysis - 
Non Linear Data Analysis 

• MuMway Data Anafysis 

Analysis of Multiway Data - Panel Data Analysis 

• Proximity Structure Anafysis 
Multidimensional Scaling - Similarities and Dissimilarities - 

• Software Developmoits for Qassificatum and Data Anafysis 




VI 



Algorithms for Hierarchical and Non Hierarchical Classification - Computer Data 
Visualization. Statistical Algorithms for Multivariate Analysis 

• Applied Classification and Data Amdysis in Sodal, Economic, Medical, and 
other Sciences 

Classification and Data Analysis of Textual Data - Data Analysis in Economics - 
Classification and Discrimination Approaches in Medical Science 

The present volume contains 45 referred papers presented in four chapters 
as follows: 

Classification 

• Methodologies in Classification 

• Fuzzy clustering and fuzzy methods 

Other Approaches for Classification 

• Discrimination and Classification 

• Regression Tree and Neural Networks 

Multivariate and Multidimensional Data Analysis 

• Proximity Methodologies in Classification 

• Factorial methods 

• Spatial Analysis 

• Multiway Data Analysis 

• Multivariate analysis 

Case Studies 
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Measuring the Influence of Individual 
Observations and Variables in Cluster 

Analysis 

Andrea Cerioli 

Istituto di Statistica, University di Parma 
Via Kennedy 6, 43100 Parma, Italy, email; Stated ©ipruniv.cce.unipr.it 

Abstract: In this paper we address some issues in the field of cluster stability. 
In particular, we study the effect of deleting individual cases and variables on 
the results of a (nonhierarchical) cluster analysis. We do not restrict to com- 
putation of a single influence measure for each data point, or variable, but we 
analyze how individual influence varies when the number of clusters changes. 
For this purpose we suggest the use of simple deletion diagnostics computed 
by cross-validation. The suggested approach is applied to real data and results 
are displayed by means of a simple tool of modem multivariate-data visualiza- 
tion. Furthermore, the performance of our diagnostics is assessed through Monte 
Carlo simulations both under the null hypothesis of well-behaved data and the 
alternative hypothesis of isolated contamination. 

Keywords: cluster stability; deletion diagnostic; fc-means; outlier; stalactite 
plot. 

1. Introduction 

In a hierarchy of problems, influence detection has become a major issue in the 
field of classification stability, which in turn falls within the broad framework 
of clustering validation (see Milligan, 1996, and Gordon, 1996, Section 7, for 
general reviews of these topics). Specifically, the purpose of influence detection 
is to identify those data points that have a large impact on the partitions obtained 
from a clustering method. Influence detection usually proceeds by comparing a 
reference partition, computed on the complete data set by means of a specified 
clustering algorithm, and a modified partition, obtained from a reduced data set 
using the same algorithm. The reduced data set is produced by simply deleting 
either a single case or a single variable from the complete one. So far, research ef- 
forts have mainly focused on measuring the influence of individual observations 
on results from hierarchical clustering (see, e.g., Cheng and Milligan, 1996, and 
the references therein), although occasional interest has also arisen in the related 
area of identifying influential variables (Gnanadesikan et al, 1977). 

The purpose of this paper is to extend previous work in the field of influ- 
ence detection in a number of ways. Firstly, we analyze how single-case influ- 
ence varies in (nonhierarchical) clustering when the number of cluster changes. 
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This task is accomplished by computing cross-validation deletion diagnostics at 
successive steps of the clustering process. The resulting pattern of influential 
observations is then displayed by means of a stalactite plot (Atkinson, 1994). 
Secondly, we adopt a similar approach to study the influence of individual vari- 
ables across a varying number of clusters. Finally, we perform a Monte Carlo 
experiment to assess the behaviour of our deletion ^agnostics both under the null 
hypothesis of well-behaved data and the alternative hypothesis of isolated con- 
tamination. In addition, we suggest an extension of the seed selection technique 
adopted in the FASTCLUS procedure of SAS (1990), in order to overcome the 
possible effect of the observation order on results from the fc-means algorithm. 

2. Deletion Statistics in Cluster Analysis 

Recent contributions to the identification of influential observations in cluster 
analysis include Jolliffe et al. (1995), and Cheng and Milligan (1996). Both pa- 
pers explicitly consider only hierarchical methods and, more importantly, quan- 
tify the effect of each data unit through a single number. However, we believe 
that observing individual influence at successive steps of the clustering process 
can lead to a better insight into the data. As our simulations show, this approach 
can also have beneficial consequences on the choice of the final number of clus- 
ters. 

Let n be the total number of objects to be classified and let P* denote the 
A:-clusters reference partition. In our applications P* is obtained by clustering 
the complete data set of n elements. On the contrary, we do not address the 
related problem of cluster recovery, where Pfc is the true, but usually unknown, 
underlying cluster structure. Furthermore, let be the A: -clusters partition which 

is computed after deletion of object i (i = 1, . . . , n). A measure of influence of 
object i on the A;-clusters solution, say $1, is then defined as a disagreement 
measure between P* and P^, provided that information about i is removed from 
the reference partition. Any index for partition comparison belonging to the class 
proposed by Hubert and Arable (1985) can be used at this step. 

Examination of values 9\, 2 < k < n — 1, provides the basis for identify- 
ing the pattern of outliers or other highly influential observations across different 
clustering solutions. Clearly, a detailed graphical presentation of all such values 
becomes infeasible in many practical applications, where n is reasonably large. 
Therefore we choose to display only units for which at least one $l is sufficiently 
“high”. In principle, each could be judged with respect to the distribution of 
the corresponding random variable under the null hypothesis that Pfc and P^ are 
independent partitions of the same set of objects (Hubert and Arabie, 1985; Ce- 
rioli, 1997). However, such an extreme hypothesis does not seem to be adequate 
in the present context, where P* and are computed from data sets differing 
only by one observation. As it is difficult to devise what alternative distribu- 
tion might be relevant, we take a more practical approach and standardize &1 by 
cross-validation. 

Let Ok = EiOi/n be the average disagreement measure at the A;-clusters 
level, and 

_2 EM - Okf 



n 
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The cross- validated standardized value of B\. is then defined as 






if a* > 0, 



( 1 ) 



and z\ = 0 otherwise. 

We regard as potentially influent on the /c-clusters solution those objects for 
which 4 > a fixed threshold. A graphical display of potentially influent data 
units is then produced through a stalactite plot (Atkinson, 1994). In the examples 
that follow, we take z* = 2 and 2 * = 3 as useful cutoff points. The adequacy 
of these thresholds is assessed through Monte Carlo simulations in section 3. 
Furthermore, a number of measures of cluster cohesion can be computed at each 
step, as a supplement to z\, to see whether any influential object is either an 
inhibitor or a facilitator in the clustering process (Cheng and Milligan, 1996). 

The approach outlined above is easily extended to the detection of influen- 
tial variables, in the spirit of the seminal paper by Gnanadesikan et al. (1977). 
Let p denote the total number of variables used in the clustering process. A 
measure of influence of variable j (j = 1, ... ,p) on the /c-clusters solution can 
be obtained by simply comparing Pk and P^, where P^ now denotes the par- 
tition computed after deletion of variable j. In this case, we also suggest to 
monitor how distances from centroids change when passing from Pk to P| . Let 
dik denote the Mahalanobis distance of object i from its cluster centroid in the 
reference /c-clusters partition, and let be the corrisponding distance in P^. 
Compute average distances dk = Ei dik/n and d{ = Ei d{k/n. Then, for each 
j = 1,. ..,p, examination of quantities 

Ai = dk/p - di/{p - 1) /c = 2, . . . , n - 1 (2) 

provides further information on the influence of variable j at successive steps of 
the classification procedure. 

3. Simulation Study 

A small Monte Carlo experiment is performed in order to assess the behaviour of 
our standardized deletion statistics zl and the adequacy of threshold values z* to 
be used for the identification of influential observations. In particular, we repeat- 
edly generate independent realizations from a eight-variate normal distribution 
through a slightly modified version of the algorithm in Jolliffe et al. (1995). In 
the present study, 1, 000 data sets are simulated under the null hypothesis of no 
outliers, and further 1, 000 data sets under the alternative hypothesis of isolated 
contamination. 

In each of the outlier-free simulated data sets, there are 80 observations gen- 
erated to fall into eight clusters of equal size. Clusters are well separated and 
have the same dispersion. All cluster centroids lie within the unit hypercube. 
A /c-clusters partition is then obtained for each data set through the convergent 
/c-means algorithm, for several values of k. The clustering algorithm is started 
from carefully selected seed points. For this purpose, we suggest an extension 
of the seed selection technique adopted in the FASTCLUS procedure of SAS 
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Figure 1: Estimated Type-1 error rates under the null hypothesis of no outliers, 
for the test based on z* = 3 and for different values of k. FASTCLUS seed 
selection algorithm. 




unit 



(1990). Our method, which is detailed in the Appendix, is motivated by the large 
effect that the order of the observations in the data set may have on results for 
single-case deletion diagnostics when the standard algorithm is applied. 

The disagreement measure is defined 9\. = 1 — R\, where is the 
corrected-for-chance Rand index for comparing partitions Pk and PI. There- 
fore 61 = 0 if a perfect match exists between Pk and P^, while values of 61 near 
1 show a chance-level agreement between the two partitions. 

For each data set, cross-validation standardized values z\ are computed as in 
(1) and then compared with several thresholds. Given a specific cutoff z*, the 
Type-1 error rate for unit i in the A:-clusters partition is estimated as the propor- 
tion of simulations in which 4 exceeds Figure 1 displays estimated error 
rates under the null hypothesis of no outliers in the data, for z* = S and a few 
values of k, when the FASTCLUS seed selection technique is adopted. Results 
from the standard procedure are clearly not satisfactory, due to the decreasing 
trend in all estimated Type-1 error rates. On the contrary, as Figure 2 shows, our 
method proves to be largely insensitive to the observation order for all reported 
values of k. 

For our seed selection procedure, Monte Carlo estimates of average error 
rates under the null hypothesis are given in Table 1 for both cutoffs z* = 2 and 
2 * = 3, and for 6 < ^ < 11. With well-separated clusters, all (estimated) aver- 
age rates for the test based onz* = 2 are less than 7.5%, while the corresponding 
rates for the highest threshold do not exceed 3.0%. Indeed, average error rates 
of these tests are as small as 1.1% and 0.6%, respectively, when k is set equal 
to the true number of clusters. Thus we conclude that the diagnostic technique 
proposed in the present paper does not seem to produce spuriously large num- 
bers of outliers with well-behaved clustered data. Furthermore, computation of 
cross-validation standardized statistics can convey useful information also on the 
related problem of choosing the appropriate number of clusters. 

To assess the performance of our deletion statistics under the alternative hy- 




7 



Figure 2: Estimated Type-1 error rates under the null hypothesis of no outliers, 
for the test based on z* = Z and for different values ofk. Seed selection algo- 
rithm as in the Appendix. 




unSt 



Table 1: Average error rates for cutoffs z* = 2 and z* = 3, and for different 
values ofk. Seed selection algorithm as in the Appendix. 



k 


6 


7 


8 


9 


10 


11 


z* = 2 
z’ = 3 


0.056 

0.023 


0.048 

0.019 


0.011 

0.006 


0.058 

0.029 


0.073 

0.028 


0.074 

0.024 



pothesis of isolated contamination, 1, 000 additional data sets are simulated by 
adding one outlier to 80 well-behaved clustered data (generated as before). In all 
contaminated data sets, the outlier is generated to fall near (or slightly outside) 
the upper boundaries of the unit hypercube. Figure 3 displays boxplots of esti- 
mated iype-1 error rates for the test based on z* = 3. For ease of presentation, 
we restrict to 7 < /: < 10. The outlying unit is clearly revealed irrespective of 
the chosen number of clusters, as it exhibits a disproportionately large error rate 
for all values of k. Also note that in this case the distribution of estimated error 
rates correctly supports the existence of nine clusters with one outlier. 

4i Example 

As an expository example, we classify the municipalities (comuni) in the province 
of Parma, Italy, according to 5 demographic indicators taken from the 1991 Ital- 
ian Census. In the present example n = 47. A /c-clusters partition is obtained 
through the convergent /c-means algorithm, with the seed selection procedure de- 
scribed in the Appendix. The clustering process is repeated for several values of 
k. For simplicity, we restrict the analysis to Z < k < 8, which is the range of 
practical interest in this application. 
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Figure 3: Boxplots of estimated Type-1 error rates under the alternative hypoth- 
esis of isolated contamination, for the test based on z* = 3 and for different 
values of k. Seed selection algorithm as in the Appendix, 



(D 

d 



lO 

d 




7 8 9 10 



number of clusters 



At each step, the disagreement measure is again defined as = 1 — R\. 
Cross-validation standardized values z\ are then computed as in (1). The result- 
ing stalactite plot is given in Figure 4, where columns represent units and rows 
correspond to values of k. For ease of presentation, we only display units for 
which at least one z\ > 2. Note that the actual definition of 6\ is not crucial, 
as very similar results are reached with a number of alternative choices for 9\. 
This appears to be an advantage of standardization over the use of raw disagree- 
ment measures, which can lead to discordant information using different criteria 
(Jolliffeera/., 1995). 

The plot in Figure 4 shows the influential nature of observation 5 across sev- 
eral clustering solutions. At a closer inspection of the data, it can be seen that this 
unit is an outlier on the first variable, measuring the ratio between the number of 
elders over 65 and the number of boys aged 14 or less. It is also apparent from 
Figure 4 that individual influence can vary markedly as the number of cluster 
changes, and other municipalities can be occasionally influent for some values 
of A;. 

Results concerning the identification of influential variables are given in Ta- 
ble 2, where we report the values of A;[ for all variables and 3 < k < 8. The 
figures in Table 2 clearly reveal the large influence of the first variable. In par- 
ticular, it is seen that deletion of this variable leads to a considerable reduction 
in average (adjusted) Mahalanobis distance for all values of k in the range of in- 
terest. Therefore, clusters become more compact after removing Variable 1. For 
k >6, deletion of Variable 3 also provides a slight improvement in the computed 
classification. 
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Figure 4: Example. Stalactite plot of standardized Rand index. Columns refer to 
units and rows to values ofk. ** denotes z\> Z, ^ denotes z\ > 2. 
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Table 2: Example. Average change in adjusted Mahalanobis distance A^. 



k 


Var. 1 


Var. 2 


Var. 3 


Var. 4 


Var. 5 


3 


5.57 


-2.00 


-0.93 


-2.09 


-2.10 


4 


3.18 


-1.32 


-0.39 


-1.42 


-1.42 


5 


3.07 


-1.16 


-0.32 


-1.26 


-1.26 


6 


2.39 


-0.91 


0.19 


-1.05 


-1.05 


7 


2.15 


-0.61 


0.12 


-0.96 


-0.96 


8 


1.83 


-0.69 


0.51 


-0.84 


-0.84 



5. Discussion 

In this paper we propose simple diagnostic tools for the purpose of detecting 
influential units and variables in nonhierarchical cluster analysis across different 
clustering solutions. The effectiveness of the suggested method is illustrated 
both by means of real and simulated data sets. Furthermore, we show how to 
overcome the possible effect of the order of the observations on results from the 
A:-means algorithm. 

However, a well known problem with single-case deletion diagnostics is that 
they can suffer from the problems of masking and swamping when multiple out- 
liers are present in the data (Atkinson, 1994). This is true also for quantities like 
(1) and (2). Therefore, further research must be devoted towards the definition 
of truly robust methods for the detection of masked multiple outliers in cluster 
analysis. 

Appendix: Seed Selection Algorithm for Nonhierarchical Clus- 
tering 

Step 0. Fix the number of clusters, say k. 

Step 1. Scan the data in natural order. Compute k preliminary seeds as in the 
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FASTCLUS procedure of SAS (1990). Let 5i be the set of indexes of the obser- 
vations selected as preliminary seeds at this step. 

Step 2. Scan the data in reverse order. Compute k preliminary seeds as in the 
FASTCLUS procedure of SAS (1990). Let 52 be the set of indexes of the obser- 
vations selected as preliminary seeds at this step. 

Step 3. Define S3 = Sif)S2- Let k' be the cardinality of S3. Take the observa- 
tions indexed in S3 as initial seeds for cluster analysis. Stop if k' = k\ otherwise 
put j — k' and go to Step 4. 

Step 4. Increase i by 1. Let S[ and 5^ be the sets of indexes of preliminary 
seeds not already selected in Si and S2, respectively. Define S' = 5( U 5'2> and 
5 = (5i U S2) — S'. Choose the unit in S' which has the largest distance from 
the nearest seed in 5 as the j-th initial seed for cluster analysis. Iterate this step 
until j = k. 
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Abstract: In multiple time series analysis, when there are a very large number 
of series, a classification into homogeneous clusters might be useful to reduce 
the problem’s complexity and eliminate possible redundancies (Zani, 1983). 
Furthermore, when we have different classifications, one for each statistical unit 
(e. g. spatial units), a consensus classification allows one to obtain a 
classification which summarizes the given classifications. The present paper 
focuses on the problem of identifying consensus classifications in a set of 
multiple time series (panel data), using a consensus method (Vichi, 1993, 
1994). First, a distance among time series is defined and a hierarchical 
classification among time series, for each temporal lag and for each unit, is 
performed. Then, a consensus classification among different units for the same 
temporal lag is carried out. Finally, a hierarchical classification among the 
different consensus classifications, with the same temporal lag, is carried out. 

Keywords: Distance between Time Series, Hierarchical Clustering of Time 
Series, Consensus Classification. 



1. Introduction 

Consensus analysis applied to time series allows us to summarize the 
information about the dynamic structure of phenomena concerning different 
units. This implies that the data structure associated to these phenomena is a 
three-way array. In the present paper we suggest an original procedure based on 
some new tools of multivariate statistics, generally applied to cross-sectional 
data sets. Different approaches and applications about time series classification 
are present in the literature (Bohte et ah, 1980; Piccolo, 1990; Zani, 1983). In 
particular, the approach by Bohte et al. (1980), which takes into account the 
dimension of time through specific functions of time series (autocorrelation and 
cross-correlation), has been considered in this paper. Then, the consensus 
classification proposed by Vichi (1993, 1994) is briefly described (section 3) 
and we suggest applying the consensus method, usually utilized for cross- 
sectional data sets, to stationary time series (section 4). The suggested 
procedure allows us to obtain an informative synthesis of structural and 
dynamic changes of a phenomenon measured in different units. 
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2. Clustering of a Multiple Time Series 



In the literature on clustering time series, among the different approaches 
mentioned above, we follow the procedure suggested by Bohte et al. (1980), 
based on different distances between time series. Let x, be a 

multiple time series where and Xj, denotes the observed 

value at time t of the j-th time series, j J. Assume that the time series are 
stationary. 

In order to analyze the proximity relationships among a pair of stationary time 
series Xj and x„, j , considering the temporal lag, the dissimilarity 

coefficient with lag k, is utilized; 



rjj [k)r^ {k) - rj„ {k)r^ (/:)| 
-T, k>0 

1 + rjj{k)r^{k)-rj„{k)r^{k^ 



( 1 ) 



where (k) is the autocorrelation coefficient of time series Xj at lag k 

and Vj^{k) is the cross-correlation coefficient between the time series Xj and 
*m at lag k. It is easy to prove that (1) satisfies the axioms: 

d[xj,x^)>0-, d[xj,x^) = d{x^,Xj)\ x^. = x„ => cl(x„,x^.) = 0 so that (1) can 

be defined as a dissimilarity coefficient but not as a distance function. After the 
dissimilarities dj„ j^m=X...,J between pairs of time series have been defined, it 

is possible to identify groups of time series which show links in their temporal 
trend, with possible lag. In order to identify groups of series, which show a high 
similarity in their temporal trend, a hierarchical classification method can be 
used, as proposed by Zani (1983). As a basis for the clustering of time series the 
single linkage method has been chosen. It is well known that in this case only 
the above axioms, a subset of the full set of axioms satisfied by a distance 
function, have to be satisfied by the dissimilarity coefficient chosen for the 
analysis. This clustering of time series is different from a usual classification for 
the presence of the temporal lag which marks the linkage among the groups 
already obtained. 



3. Consensus Classification 

The methods for determine a consensus classification allows us to synthesize 
several different hierarchical classifications into a single one which detects the 
information in the given classifications. Among the approaches proposed in the 
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literature, we consider the Average Consensus or Least Squares Consensus 
Dendrogram, proposed by Vichi (1993, 1994). This method achieves the 
consensus classification solving the following quadratic problem: 

I 2 

min XF, -U*|| (2) 

1=1 

over lf= M* such that u..=0 and under the constraint that U* must be an 

L JJ 

ultrametric matrix. U, is the ultrametric matrix associated to the i-th unit and 

U* is the closest least squares matrix subject to the ultrametric conditions 
(ultrametric consensus matrix) (Simeone & Vichi, 1996). In order to achieve a 
consensus classification for a set of multiple time series we consider the method 
proposed by Vichi (1993, 1994). 



4. Classification of Multiple Time Series Observed in Several 
Units 

Suppose we have to classify a set of time series referring to the same 
phenomenon but measured in different units i = l,...,7. Thus, the collected data 
set may be organized as a three way array: 

X=[X,,...,X. ,...,X,] where 5^ and 3^ -O) 

The procedure proposed may be formalized as follows. 

1 . First, a hierarchical clustering method (single linkage) has to be applied to the 
multiple time series X,. for each unit i (i = 1,...,/ ) and for each temporal 

lag jk (it = 0,...,X- 1). In particular, for each unit i, a classification of the time 
series’ elements, associated with the same temporal lag, must be determined. 
Thus, for each unit i, we have K hierarchical classifications and then K 
ultrametric matrices. Referring to the i-th unit we have the following K 
ultrametric matrices: U, (0),...,U, (fc),...,U, (X- 1) for a total number of IxK 
ultrametric matrices. The above matrices have been obtained using a 
generalization of (1) to / units: 

1 + hi/ W'imm W - 'Vm Wr 



k>0, 



i = (4) 
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where ryj{k) is the autocorrelation coefficient of time series Xj at lag 

k, referring to the i-th unit, while r^^ik) is the cross-correlation coefficient 
between the time series and x„ j¥m=\,...,J at lag k referring to the same unit. 

2. Regarding the same temporal lag a least square consensus classification 
(section 3) among the ultrametrics, associated to the different units, must be 
determined. Thus, we have K ultrametric consensus matrices, one for each 
temporal lag: 

u* (0),...,u* u*(/s:-i) (5) 

obtained solving (2), under the same constraint, for each lag k (k=(\...,K-l): 
min 



3. This part can be formalized as follows. 

3.1 The different ultrametric matrices (5) are summarized in a three way 
matrix: 

i; =[i/((^,...,ir(4...,ir(/s:-i)]', u‘(a:) =[«;(^)] , a) 

and ujm{k) is the ultrametric between xj and Xm calculated for each lag k. 

3.2 In the U* we detect d*J^k*) = ini where k’ is the temporal lag 

referring to the minimum distance between Xj and x„. Then, we have a 
dissimilarity matrix D\k'‘)=[d)m(k*)] obtained by considering the minimum 
distance among the same pairs of time series for the temporal lag k* . 

3.3 We apply a hierarchical clustering algorithm (single linkage) to D*(A:*). 
Finally, we have a single hierarchical classification lJ'‘*(k“‘)=[ujm(k )] 
characterized by the association into homogeneous groups of multiple time 
series referring to different units. This classification takes into account the 
distance among the pairs of time series (or groups of time series), as well as the 
temporal lag k at which the series have been associated. 

The ultrametric matrix U**(A:*) is in bijection with a “labeled” dendrogram in 
which one can find, for each association step, the temporal lag k*. 

The obtained classification allows us to consider the information about a data 
set referring to different units, as well as the evolutive behavior of the studied 
phenomenon. This dynamic behavior is measured by the temporal lag at which 
the time series have been clustered. With the proposed procedure it is thus 
possible to analyze complex phenomena considering both structural and 
dynamic aspects. 
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5. An Example of the Proposed Procedure 

In order to illustrate the procedure described in this paper, we will consider a 
three way data set regarding the economic performance of the most 
industrialized countries. 



5.1. Time Series Classification Referring to Countries for Each Temporal 
Lag 

The three way data set is made up of 7 countries x 8 economic variables x 27 
years, from 1970 to 1996. The 7 countries considered are the G7 countries: 
Canada, France, Germany, Great Britain, Italy, Japan and USA. The 8 economic 
variables (source: Datastream) are: Gross National Product per capita in US 
dollar terms (vi). Gross Domestic Product at price and exchange rate of 1985 
(V2), Unemployment Rate (V3), Consumer Price Index (V4), Exports in US dollar 
terms (vs). Imports in US dollar terms (vg). Industrial Production Index (y^), 
Current Account Balance in US dollar terms (vg). In order to equalize the size 
and the variability of the input variables, a standardization procedure has been 
performed, as suggested by Milligan & Cooper (1988). Furthermore, the time 
series already standardized, have been made stationary by taking the Vth 
differences (Box & Jenkins, 1976). Considering the dissimilarity coefficient 
{dyjk), i = l,.. .7; 7 ^ m = 1,...,8; k = 0,...,4j 35 dissimilarity matrices 

between time series have been calculated. The 35 matrices refer to the 7 
countries multiply by the 5 temporal lags. The single linkage clustering 
algorithm has been applied in order to classify the 8 time series for each country 
and for each temporal lag. Thus, we have 35 ultrametric matrices and hence 35 
dendrograms. 



5.2. Consensus Classification for each Temporal Lag 

Regarding the same temporal lag, a least square consensus classification among 
the 35 ultrametrics has been obtained in order to obtain 5 synthesized 
classifications. Specifically, each consensus matrix has been obtained by 
synthesizing the dissimilarity matrices referring to the different G7 countries at 
the same temporal lag. The ultrametric consensus matrices and the consensus 
dendrograms at lag h=0, k=l, k=2, k=3, b=4, are respectively shown in Figure 1. 
As shown in the figure, for the lags k=0, 1,3,4, in correspondence of small 
dissimilarity levels, it may identify two groups. In the first one, the Gross 
Domestic Product (V2) is amalgamated with the Industrial Production Index (V7). 
In the second one, the Exports (vs) and the Imports (ve) are jointed together. The 
other variables are aggregated at higher dissimilarity levels. At the lag k = 2 
one can find again two groups: the first one is the same (v 2 ,vv), but the second 
one is different: the Unemployment Rate is connected with the Imports (v 3 ,ve). 
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5.3. Final Classification 

Considering the 5 consensus ultrametrics, the minimum distance between each 
pair of time series, = inf ^ = 0 , 1 , 2,3,4 , has been calculated. 

k* is the temporal lag referring to the minimum distance between the pairs of 
time series. Thus we have a dissimilarity matrix D*(^*) . 

Finally, the single linkage clustering method has been applied to D*(A:*). A 

final classification has thus been obtained. This classification allows us to 
identify the temporal lag at which the variables are amalgamated. The 
ultrametric matrix and the associated dendrogram, referring to this final 
classification, are displayed in Table 1 and in Figure 2, respectively. 



Table 1: Final Labeled Ultrametric Matrix (each distance value is multiply by 
100 ) 





vl 


v2 


v3 


v4 


v5 


v6 


v7 


v8 


vl 


0 


0.75(2) 


0.75(2) 


0.78(4) 


0.75(2) 


0.75(2) 


0.75(2) 


0.75(2) 


v2 




0 


0.71(3) 


0.78(4) 


0.71(3) 


0.71 (3) 


0.38(3) 


0.73(4) 


v3 






0 


0.78(4) 


0.65(2) 


0.65(2) 


0.71(3) 


0.73(4) 


v4 








0 


0.78(4) 


0.78(4) 


0.78(4) 


0.78(4) 


v5 










0 


0.44(3) 


0.71(3) 


0.73(4) 


v6 












0 


0.71(3) 


0.73(4) 


v7 














0 


0.73(4) 


v8 
















0 



In parentheses there is the temporal lag k . 



Figure 2: Final Labeled Dendrogram (each distance value is multiply by 100) 



vl 
v2 
v7 
v3 
v5 
v6 
v8 
v4 

0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 



k=2 




As shown in Figure 2, the following results may be observed. 

The variables V 2 and and the variables vs and vg are amalgamated at lag 3 
even if the aggregation distance levels are different. Then, V 3 is connected with 
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the group (v 5 ,ve) at lag 2, the group (V 2 ,V 7 ) with (V 3 vs.ve) at lag 3, vg with 
(V 2 ,V 7 ,V 3 V 5 ,ve) at lag 4, vi with (V 2 ,V 7 ,V 3 vs.vg vg) at lag 2, finally V 4 is jointed to 
(Vl,V2,V7,V3 V5,V6 ,vg) at lag 4. 



6. Final Remarks 

Multiple time series classification has been used so far for series of data 
containing single units (Bohte et al., 1980; Piccolo, 1990; Zani, 1983). In the 
present paper, a generalization of time series classification to the situation in 
which there are several units is proposed. This original procedure has been 
obtained using the consensus method (Vichi, 1994). Furthermore, the suggested 
procedure allows us to consider the temporal lag at which the series are 
associated. Finally, an example of the proposed procedure has been presented. 
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Abstract: In adaptive designs defined by selecting an initial simple random 
sample with or without replacement, the sample mean estimator is unbiased 
only if the initial sample is used, whereas, it is biased when a sample obtained 
at the end of the adaptive procedure is considered. In the last situation the 
estimator has been opportunely modified (Thompson and Seber 1996). 
However, for several estimators different from mean, such as the variance, the 
construction, in adaptive design, of a corresponding imbiased estimator has not 
been solved. 

In this paper the BAGS (Bootstrap for Adaptive Cluster Sampling) procedure 
based on a resampling is proposed to estimate the bias of an estimator. 

Keywords 

Adaptive Cluster Sampling, Resampling, Bootstrap, Bias 



1. Introduction 

In this paper we consider a sampling design where the n units are selected with 
simple random sampling with replacement and where the probability selecting 
the z-th unit in a draw is known. It is common knowledge that we may obtain 
unbiased estimators of the parameters of the population. In particular, regarding 
the estimator mean of the population we have an unbiased estimator of the type 
where every y, observed in sample units is divided by the 

probability of associate selection and it is multiplied by the number of times 
that the unit is selected. The Hansen-Hurwitz estimator is unbiased, but in 
adaptive designs, are not applicable because the probabilities of selection are 
unknown for every unit of the sample. 

Using network Thompson (1990) has proposed for the mean a modified 
unbiased Hansen-Hurwitz estimator also if it is applied on the final sample. 



*This work, though it was the result of a close collaboration of the two authors, has been 
specifically elaborated as follows: the sections 1, 3 and 5 by T. Di Battista and the remainder by 
D. Di Spalatro. 
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However, li T„=t is an estimator that differs from the sample 

mean, such as the variance estimator, it may be difficult to construct a 
correspondent unbiased adaptive one. Nevertheless the use of biased estimators 
is in same situations the unique alternative. For example, if the statistical data 
have a point pattern and the aim of research is to study the spatial dispersion; a 
function of the variance, such as the ratio between sample mean and variance, is 
used. In this case it may be useful at least to estimate the bias of the variance 
estimator obtained at the end of an adaptive procedure. The aim of this paper is 
to give a useful estimate procedure of the bias of an estimator T„ that is applied 
at the end of the adaptive procedure. 

In section 2 we propose a method based on a resampling technique, called 
BAGS (Bootstrap for Adaptive Cluster Sampling). In this context, we consider 
a sampling design where the n units are drawn with replacement and have the 
same drawn probability; in section 3 we prove the consistency of the method; in 
section 4 through a simulation, we evaluate the behaviour of the method for 
small samples. Finally, section 5 gives a discussion and possible future 
developments. 



2. The BAGS method 



The aim of this paper is therefore to estimate. 



bias{T^ — Ep^^ ^ 



( 1 ) 



where Q indicates that the T„ estimator is computed on the final sample 
produced by an adaptive procedure. In particular, since we work with a finite 
population, (1) is given by 



bias 




( 2 ) 



where M is the number of samples of size n that we may draw from a finite 
population of size N. 

The estimation procedure of (2), may be obtained with a resampling procedure 
(Efron and Tibshirani, 1993), where after drawing with replacement from a 

population composed of N units a sample of size n X = ( , .... ,...., X„ ), we 

deduce the empirical distribution function F from the sample X and we 
calculate the estimator. 






21 


II 

h 

h 


(3) 


Then, we draw from F a bootstrap sample 




X'=[Xu 


(4) 


By applying the adaptive procedure, for each unit draw, we 
obtaining a bootstrap final sample of the type 


get additional imits 


Y* ( Y* Y* Y* • Y* Y* Y* • Y* 

If A. nj,.. 


(5) 


where Aj (j = is the number of the units that are aggregated to i-th 

initial unit included the /-th imit itself. 

Sample (5) includes every initial unit comprised in the bootstrap initial sample 
and the units produced at the end of the adaptive procedure. From the sample 
(5) we may obtain the estimator 


T^ = f(x;,}. 


(6) 


An estimate of bias[T^) is given by; 




bias(r^)=[^T;,Jn?\-T„ 

\k=\ / y 


(7) 



where ni is the number of bootstrap samples drawn from F . 

The main difficulty of this procedure is to compute the following expression 









m 



( 8 ) 



Actually, this computation requires the use of all rn bootstrap samples. 

The problem is numerically solved by drawing a large number B<rn of 
bootstrap samples from which we obtain 

7 ^* 



from which an estimate of (7) is given by: 
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( 9 ) 



3. Consistency of the BAGS method 



The BAGS method is consistent in the sense that when the sample size increase 
the bias estimated by the BAGS method converges to the bias generated by 
adaptive design. 

Let denote a sequence of random variables, for f=l,2,...,oo; and be 

a sequence of random variables where S„ = («=1.2,...,oo), such that 

/=! 

e[Xi^Ai) = ju^. Let {A'jj} denote the sequence of sample means 



/ ” 

^.Aj . is the mean of the cluster obtained applying the 



adaptive procedure to z-th unit of the population. 
Thus we can write: 



If we can define a fimction of the estimator Xn, T^=g{Xn), twice 

differentiable, applying the delta method to the 7^ estimator we can write (Pace 
and Salvan, 1996): 



1'n = + ^'(Po) - Po) + ^g"(Pn) ' (^n " Pn)' + Oj,(x^ - 

As known ” Pn) = so 

= g(Pfi) + ^'(Pn) • (^n " Pn) + ^g"(Po) ' (^n " Pn)' + Op{n^^^); 
E{Tn) = g{Mn) + ^ g"{Mn) ■ ^[(^n ~ Mnf + Op (n^'^ ) ; 




23 



Bias{Tn) = g(//n) + |g"(yUn) • - //„)' + Op[n-^'^)- d . 

Finally, let Xn be the sample mean of a bootstrap sample and 7^ = g(Xn) then 
we have: 

^ “ ^M'o) "(Xn — + '(Xj! “ +Op{n 

£{j^ ) = «(/'»)+ 5?"Un)- £[(7o -/'!.)“]+ o,(»-®) ; 

Bia^T ^ ) = g(f‘a) + ^g"{t‘a) ■ ^[(^n ~Maf + Or(n'^)-T, . 
Following (Shao and Tu, 1995), we have: 

5/a^(rn)-5m^(r£;) = ^g"(//n)|£:[(Xn-//a)']-£[(X* }+ 

+ T„-0 + Op{n-^/^). 

Therefore, the limit for « -> oo of equation (10) proves the convergence and the 
consistency of BAGS procedure. 

4. Some simulation results 

In order to evaluate the behaviour of the BAGS method we have simulated a 
finite population P of 100 units and we have considered a contiguity criterion 
north-south, east-west as shown in Figure 1. 



Figure 1 . The population P and the type of neighbourhood for unit i 
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Applying the adaptive procedure we have considered the aggregation condition 
C of the typey, > c for c =1,2 (Thompson and Seber 1996). 

In this context, we have drawn with replacement from the population P a 
sample size of order: 2,3,4,5,6 respectively. We have computed the real bias of 
the sample mean and variance mean applied to final sample with reference to 
sample size n and C condition. 

Then we have drawn 5=100 bootstrap samples from initial samples and we 
have estimated the bias of the sample mean and the sample v£iriance. 




Figure 3. Real Bias and Estimate Bias of sample mean with c >2 . 
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Figure 4. Real Bias and Estimate Bias of sample variance with c>l 




Figure 5. Real Bias and Estimate Bias of sample variance with c >2 




The results shown in Figures 2, 3, 4 and 5 above, verify that the BAGS method, 
starting from small sample size, gives the bias estimate close to the real bias 
respectively of the population mean and variance. 

5. Discussion 

From the results obtained in section 3 and 4 we proved that the BAGS method is 
able to estimate the bias of the mean and variance estimators computed on the 
final sample produced by the adaptive procedure. 
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In particular, in section 3 we proved the asymptotic convergence of the method, 
while in section 4 we have shown that the method gives a good estimate of bias 
starting from small sample sizes. 

If our T„ estimator applied to initial sampling is biased, the BAGS method gives 
information about the total bias of T„ estimator computed on the final sample. 
Moreover the BAGS method enables us, through little modifications, to estimate 
accuracy measures of the T„ estimator computed on the final sample that is 
different from the bias, for example the standard error or the variance of T„ . 

The BAGS method coincides with the traditional bootstrap method when the 
units that belong to the drawn sample do not produce some additional units 
through an adaptive procedure. 
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Abstract: This paper focuses on the problem of forecasting a classification given 
a panel data set formed by a (multiple) time series of partitions of a same set of 
units. As far as we know, in classification and time series analysis this is a 
completely new problem. A methodology based on a vector autoregressive model 
is here proposed to directly forecast a partition given a (multiple) time series of 
partitions with the same fixed number of classes for each time. Two reeil panel 
data have been analysed with this new procedure. Open problems are discussed in 
a final section. 

Key words: Cluster analysis, partitioning, panel data. 



1. Introduction 

Let Xs{ac.^.j .• igI, jeJ, teT}, be the three-way data array, where is a real value 
of the j-th variable observed on the i-th unit (object), according to the /i-th 
situation, and Is{l,...,n}, Js{l,...,k} and T={1,...,T} are the sets of indices of 
modes i (units), j (variables), and t (occasions), respectively. The most widely 
collected three-way data set is given when units and variables remain the same 
and situations are different time points. Thus, units are followed examining the 
changes of variables over a set of different time points. These collected data are 
called in field economic a panel data set. Primary survey aspects such as 
measurement errors, not complete coverage of the population of interest, non 
responses, bias recalls for panel data are not discussed here; see Bailar (1989) for 
these aspects. 

Here we suppose we have observed or computed a particular panel data set formed 
by T partitions of a same set of objects I examined at T temporal occasions. 

A partition at time t may be specified by a (n x c) (units by classes) matrix 
=[Pi 7 r]. where p,/, may be: 1) pm e {0, 1}, and pm =1 (pm =0) if object i belongs 
(does not belong) to class I of the partition at time t; 2) pm € [0, 1] denotes the 
value of the membership function of object i to class / of the fuzzy partition at time 
t; 3) Pm = dm, where dm is the dissimilarity between object i and the centroid of 
class I of the partition at time t. In ttiis paper we suppose we have observed or 
computed the dissimilarities dm- 




28 



Several types of statistical analyses may be performed on the set of classifications 
linked by time. Firstly, the classificatory information of the entire period of time, 
in which the classifications have been observed, may be summarized by a median 
consensus i.e., the best least-squares classification approximating the given set 
(see: Barthelemy and Monjardet 1981, for n-trees; Cucumel 1990, Vichi 1993, for 
dendrograms). However, a unique consensus does not give information on the 
evolution of classes of the partitions over time. Further, a single consensus may 
not be sufficient to synthesise the data when the period of time is long and 
therefore many changes may occur for many classifications. A solution to this 
problem is given partitioning the set of classifications into homogeneous subsets 
regarding classification at contiguous periods of time, and simultaneously to find a 
consensus classification for each period (Gordon and Vichi 1997). 

A completely new problem is to forecast a partition given the set of partitions 
linked by time. In this paper we are interested to define a new methodology to give 
an efficient answer to this problem. 

An outline of the material in this paper is as follows. Section 2 shows a procedure 
to obtain the set of partitions from a three-way data set and gives the methodology 
to forecast a partition. Section 3 analyzes two real panel data and shows the results 
of the forecasted partitions, computing confidence intervals for the forecasting. A 
final discussion is given in Section 4. 



2. Forecasting a partition of the units 

Let c denote the number of classes dividing the same n objects into disjoint classes 

Ci„ Cl, Cc, for each time point t. Therefore, the number of classes is implicitly 

assumed constant over time. The choice of c is discussed in section 3. 

The partition for each time t is defined by solving the following well-known 
mathematical programming problem; 

n c 

1=1 /=1 

subject to 

c 

=1. l^i^M 

/=1 

w,„€{0,l}, l<i<n, !</<c 

where, dm is the distance between object i and the centroid of class G, at time t. 
The distance dm is usually the squared Euclidean norm on SR*, i.e., dm = 
IIXi.,-z„lP, wherex,,,= (x,„, ...,x«,)'andz„=(]/IC,,lT. „ x^, ,...,l/IC,.,lY, „ )'. 
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A good heuristic solution to problem [PI], which is known to be NP-hard, is given 
applying the c-means clustering algorithm (Ball and Hall 1967, MacQueen 1967). 
The initial partition into c clusters: 

(i) can be randomly chosen for each time teT; 

(ii) randomly chosen for t=l and represents the partition achieved at time t-\, 
for reT (fti5:l). 

In the case (i), since classes change composition passing from time r-1 to time t, 
and partitions at two times are obtained independently, it is necessary to specify 
for each class at time t-l which is the corresponding class at time t. However, this 
correspondence in not necessary using (ii) since it is automatically established by 
the clustering algorithm used. 

Note that using (ii) it is implicitly supposed that the partition at time t is 
depending to the partition at time t-\ (*) and we believe this is a necessary 
condition to forecast a partition. For this reason in this paper we will use 
procedure (ii). 

If the distances between each object and the c centroids change monotonely, from 
time t-\ to t, the final partition at time t is equal to the final partition at time i-1. 
This is because under the monotone transformation of distances each object has 
minimum distance from the same closest centroid at time i-1. Therefore, a 
"stationary partition" is obtained at time i if dm changes momotonely from time i-1 
to i. However, often this is not the case and therefore we need to know if and how 
object i changes, over time, class of the partition. For this purpose we use a vector 
autoregressive model. 

Let D,=[dii,...,d,r], ..., D„=[d,i,...,d„r], denote the n c-dimensional multiple time 
series, with d,i=(diu, ..., die)' defined by the classification process described above. 
It is assumed that the c-dimensional multiple time series are generated by 
stationary stable vector autoregressive processes of order p,, YAR(p,): 

d« = v, + An d,M + ... + A,j,^ d,,_p^ + u« i=l,...n; i=l,...,T, (1) 

where v,=(vn, ..., v^)' is the c-dimensional vector of intercept terms, the Am 
m=l,...,pi are square coefficient matrices of order c and u« is a c-dimensional 
innovation process (white noise), i.e., E(Uf,)=0, E(Ui, u,,')=Zu, E(Ua Un')=0, for s?tt, 
where 2^ is assumed to be a non-singular covariance matrix. Criteria for selecting 
the VAR order will be considered in section 3. 

The compact form of VAR(p,) is: 



Di = B(Zi + Uj 



( 2 ) 



' However, in general the solution of the clustering algorithm is not depending on the initial 
partition. 
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where Z, = (Z«, Zm) with Z,r=(l, d„, d,r), 

B,=(v„ A,„ A,p ), U,=(u,i, u,t), and L is the identity matrix of order c. 

The vec form of the VAR model is: 

vec(Di) = (Zi ® L)vec(Bi) + vec(U.) (3) 

where, vec(.) is the usual stacking operator, L®M is the Kronecker product of 
two matrices L and M. 

The multivariate least squares estimator can be written as: 



Bi — (Vi , Ail , Ajp, ) — D;Z; (ZjZ|. ) (4) 

After estimating the parameters of the VAR models we can forecast the values 
dirr*h for a given forecast horizon h>0 and a forecast origin T. 

The optimal h-step forecast of the process (Liitkepohl 1991) is: 

A A A A A A 

d,T+/,= V,- + A /I d,T+;,-i + ... + A ip. diT+h-Pi (5) 

A 

The estimated global solution of [PI] when dir+/,are given is on hand. It is easily 
obtained assigning object i to class Q* if I* is such that duT*H= rmn(dar^.u : 1=1,. ..,c). 
Notice that the estimated optimal partition is found without applying any 
clustering algorithm but simply solving an assignment problem in linear 
computational time complexity 0(nc). 



3. An application to sea water pollution data 

Before partitioning the units of a panel data set a three-way pre-processing is 
necessary to carry out some basic transformations and return data appropriate for 
clustering objects and forecasting partitions. Two data pre-processing need: from a 
cross-sectional point of view, a column fiber standardization is required (Rizzi & 
Vichi, 1995) to identify clusters not influenced by units of measurement and 
different variability of the observed variables. On the other hand from a time 
series point of view the n c-dimensional multiple time series are subject to 
seasonal effects and trends, which have to be estimated and corrected with the 
usual methods. 

In order to show the proposed model of forecasting, a panel data regarding the sea 
water pollution on 17 pollution control transects, located at 500 meters from the 
coast orthogonal to the opening of the rivers, along the Abruzzo coast was 
considered. The water pollution was measured according to 10 variables, observed 
for a period of four years on a monthly basis (49 time points). 
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The values for each station were first standardized so as to have zero mean and 
unit variance. 

Each cross-sectional data matrix was classified separately using the c-means 
clustering algorithm (MacQueen 1967). The criterion proposed by Calinski and 
Harabasz (1974), and proved to be one of the best by Milligan and Cooper (1985), 
was used to determine the number of classes in each data set. We found that most 
of the classifications defined good partitions between 5 and 7 classes, and to allow 
for comparison the classification in 5 classes (mode) was considered. The initial 
partition at time t was given by the final partition at time t-\. The LS estimates 
were computed on the 17 pollution control transects. The order of the VAR model 
for each pollution control transect, found by using the final predictor error, as 
summarized by table 1, is equal to one. The stationarity and normality conditions 
of the VAR models were verified. Table 2 shows forecasts for months 50 and 5 1 . 
The 95% confidence intervals, d,ir were computed for lower and upper 
confidence limits as in Table 3. The cluster membership was determined. In the 
50* forecasted time, among the 17 pollution control transects, 1 1 remain with an 
unchanged cluster membership, and 6 report a change in cluster membership for 
one of the two limits, while in the 51®‘ forecasted time, 10 remain with an 
unchanged cluster membership, and 7 report a change in cluster membership for 
one of the two limits. 



Table. 1: Estimation of the VAR order of the 5 clusters, for 17 pollution control 
transects. The VAR models of orders pi=l,2,3,4 are estimated and the 
corresponding FPE (final predictor error) values are computed. The order 
minimizing the FPE values is then chosen as estimate for pi. 



n_J 2 3 -4 S 6 1 8 9 10 11 12 13 14 15 1«i 17 

1 0.01 0.01 0.00 0.01 0.01 0.02 0.07 0.02 0.20 0.02 0.02 0.02 0.02 0.01 0.01 0.01 0.02 

2 0.03 1.19 2.36 4.14 0.03 0.07 0.04 0.04 0.46 0.04 0.06 0.06 0.04 0.03 0.03 0.03 0.07 

3 0.08 3.32 4.31 6.24 0.31 0.12 0.07 0.05 1.06 0.14 0.13 0.24 0.13 0.05 0.06 0.07 0.15 

4 0.24 1.32 1.18 2.12 0.01 0.10 0.16 0.18 2.28 0.52 0.41 0.57 0.41 0.11 0.15 0.18 0.53 



Table 2 : Cluster membership for 17 pollution control transects: Forecasting for 
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Table 3a : 50* period: Cluster membership for 17 pollution control transects: 
95%^on^denceJnten>als^oru££er^andlowerJmi^ 
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Table 3b : 
95% Confid 
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period: Cluster membership for 17 pollution control transects: 
' Intervals for upper and lower limit forecasts 
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The previous data set was characterized hy a phenomenon which presents large 
variability over time. A second example regards the performance of seven most 
industrialized countries (Canada, France, Germany, Japan, Italy, USA, Great 
Britain according to changes of some economic indicators GDP, Expenditure, 
Export, Import, and Industrial production in the period from 1970 to 1996. As it 
can be expected for industrialized countries all variables have a strong trend with 
low variability. The per-processing indicated in the previous section was applied. 

Table 4 : Cluster membership for 7 most industrialized countries: Forecasting for 
1997 and 1998 (2f^ and 2g* period). 

forecast Canada France Germany Japan Great Italy USA 

Britain 

28 i 2 2 2 2 i i 

29 1 2 2 2 2 1 2 



Table 5a: 28* and 29* period (results are equal for the two periods). Cluster 
membership for 7 countries: 95% Confidence Intervals for upper and lower limit 

Canada France Germany Japan Great Italy USA 

Britain 

lower 1 2 2 2 2 1 1 

upper 1 2 2 2 2 1 2 
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In this example clusters do not change cluster membership for the lower and 
upper limit forecasts. This confirms that when objects present a smoothed trend 
over time forecasting of the partition is simpler. 



4. Discussion 

In this paper the problem of forecasting the partition of the units given a set of 
partitions is discussed. A direct approach has been proposed by Zani, (1981) 
starting the analysis from a panel data set and forecasting each of the n k- 
dimensional multiple time series associated to the n units to be partitioned and 
computing a clustering algorithm, e.g. c-means clustering algorithm to the 
estimated values of the multiple time series. Here we propose an indirect approach 
consisting in forecasting the distance between centers and units dm by means of 
vector autoregressive models for each c-dimensional multiple time series of the 
distances dm. When such values are estimated the forecasted classification global 
optimal solution of [PI] is obtained by an easy assignment problem. There are 
several reasons for preferring the indirect approach with respect to the direct one: 
0 in indirect approach only VAR processes have to be estimated to achieve a 
forecasted classification, meanwhile in the direct approach a classification 
technique has to be applied; ii) in economic and social phenomena where the time 
series are smoothed VAR models for each c-dimensional multiple time series of 
the distances dij,, are generally stationary of order 1; Hi) in real problems, the 
number of variables is generally much larger than the number of clusters for 
synthesis reasons, thus making it simpler to solve forecasting problems in the 
indirect approach. 
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Abstract: The paper deals with the interpretation of a neural network in terms of 
a semantic network able to describe a cluster of events. Some general remarks on 
neural networks and semantic networks are proposed and the interpretation of 
semantic networks as cluster of events is considered as a new way of 
understanding cluster analysis in a data base, provided we deal with fuzzy events. 
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1. Introduction 

The paper tries to relate four main methodological tools belonging to different 
disciplines like formal logic, artificial intelligence and statistics. The 
methodological tools are, respectively, fiizzy set theory, cluster analysis, neural 
networks and semantic networks.Fuzzy set theory deals with the generalization 
of the notion of belongness of a given event G to a set £ of events,cluster 
analysis deals with the partition 11 of a set of events G in order to 
minimiz e/maximize the variability of each subset of IT ,neural networks deal 
with optimization problems and semantic networks deal with the representation 
of a set of events related by the usual logic connectives, A(and),v(pr),—i(not) 
and by the material implication, (if .then). 

The relation among the mentioned tools can be clear when we consider the 
general target of identifying a subset of events characterized by a relationship of 
mutual dependence so that the occurrence of a given event is conditioned by the 
occurrence of the other events following the scheme of a network,that is a 
graph. Moreover each event is identified by vague attributes and its occurrence 
can be described in terms of qualitative ordering,as for mstmcQ.extremely low, 
low, high, extremely high and so on as well as by a numerical scale provided the 
upper and the lowest level are defined. Neural networks play a role as tools for 
identifying a subset whose main feature is the presence of mutual dependence 
among its members. Finally, cluster analysis occurs because the aim of the neural 
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network is to build up the best partition(covering) of the whole set of events 
identifying the set of mutual dependence relationships. 

As a first example we can consider the set of variables occurring in a data 
base. We have to build up a partition of the set of variables so that each subset of 
variables is characterized by a set of linear regression equations.Each equation 
identifies a statistical dependence and we distinguish between the set of 
independent variables and the dependent variable, in case we assume a multiple 
regression equation.In other problems we build up subsets of units whose main 
feature is their geometrical shape in the multidmensional variable space.As 
another example, we can consider the spatial accessibility among a sebset of 
towns.Another example can be considered from propositional calculus where 
we expect each subset of propositions is charachererized by a set of material 
implications so that each proposition implies all the remaining propositions. 
Fuzzy logic,clustering algorithms,semantic networks and neural networs, 
generally speaking ,are not related and the main aim of the paper is to show that 
it is possible to define a common background able to relate the previous tools in 
a imique point of view bringing to a common target. Some consequence can be 
derived both from the theoretical point of view and from the operational point 
of view.In this paper we will explore both the consequences and we will 
propose a new class of algorithms able to solve new problems in all the implied 
mentioned fields,like fuzzy set theory ,neiu-al networks,semantic networks and 
cluster analysis. 

A simple example able to show the logical connections among the previous 
concepts can be considered in order to understand easily the subsequent 
consequences. 

Let us consider a data base on books stocked in a big library. We like to build up 
an optimal partition of the books so that the books belonging to each class are 
characterized by a set of attributes related each other by fuzzy implications.In 
other terms,each attribute is presented in a fuzzy way and for each couple of 
attributes we can define a fuzzy implication able to specify the highest level of 
fuzziness. Fuzziness means that each attribute can occur at a given degree and 
the implication means the implication can attain the maximum degree provided 
both the attributes overcome a given treshold.If the books are highly devoted to 
the statistical methodology and are a low technical degree,then the implication 
means that we expect a class without advanced books in statistical 
methodology. 

We need to build up classes so that any attribute implies all the other attributes 
defining the class Finally ,neural network is a tool so that we can use imbodies 
the described logical structure ad is able to find the optimal partition. 

In the coming paragraphs we will show how to relate the four tools and we will 
propose a new algorithm both in neural network and in cluster analysis. 
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2. Clustering Algorithms 

The first generalization regards the implication m\QS.Fuzzy implications can be 
considered as set of fuzzy relations whose membership functions give a different 
meaning to the underlying concepts. Clusters of events can be considered a set 
of events linked by implication rules represented by the arcs of a graph. As far as 
the events can be described by a proposition ,we can consider a cluster of events 
as a set of propositions related by a set of material implications and by the logic 
connectives, or, »o/.,so that a function f can be optimized. The problem is to 
give a weight to the implications,to choose the function f- and to show the 
isomorfism among the clustering algorithms, the semantic networks and the 
neural networks. 

Let us consider a set of events € and a graph G(£,W) where € is the set of 
vertices representing the events and W a subset of € x €,representing the arcs 
joining the vertices.Let us consider for each vertex a weight representing its 
belongness to the set € .We can consider the fuzzy interpretation of the vertices 
as fiizzy propositions and the fuzzy interpretation of the arcs as fuzzy 
impUcations.We have two types of fuzzy implications; fuzzy material 
implication and fuzzy propositional implications, where the operator ^ is a fuzzy 
function ; 



^(p->q) = min {\,(p P + ^ q ) 


(1) 


<p (p^q) = max (0, ^ p + ^ q - 1) 


(2) 



It is easy to see that the fuzzy evaluation of the material implication and of the 
propositional calculus is based on the fuzzy evaluation of the propositions p and 
q. Kasabov (1996), Kosko (1992), 

The main problem to face is the fuzzy evaluation of the vertices of the graph 
representing the propositions and to define a partition of the graph so that it is 
minimum the fiizzy evaluation of the arcs relating the subsets corresponding to 
semantic networks. 

Clustering algorithms can be considered as a sequence of operations belonging 
both to logic and to algebra, able to produce either a partition or a covering of 
the set of the vertices of the graph representing a set of relations between 
couples of vertices, so that a function f can be optimized under some 
constraints. The operations,usually called modifiers, can be listed as follows : 
k 

i. (^P) , for k = u/v , u = 1,2,3, ..,n and v =l,2,...,n ; (3) 



ii^P=l if ^P>z and ^ P = 0 otherwise, 0 < z < 1 



(4) 
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The implication rules (1) and (2) and the modifiers (3) and (4) are algebraic 
operations which have some consequences on the arcs like their weighting and 
their cutting.More Qspliciiily, modifier (3) is able to enlarge the value of 
fuzziness attached to each attribute and therefore we can modify the value of 
fuzziness of the implication rule in the cluster to which the attribute is 
assigned-Mocfzyzer (4) is a treshold and is able to shift the fuzzy interpretation of 
a table connecting attributes to units to a boolean interpretation. 

It is obvious that modifier (4) reduce all the previos considerations on the 
interpretation on the operation of fuzzy implication to the interpretation of 
classical boolean implication. 

We have to solve three problems : 

- the definition of cluster; 

- the choice of the function to be optimized; 

- the introduction of logical and algebraic constraints. 

In Bellacicco and Tulli (1996) a general definition of cluster is given in terms of 
balanced graphs which are characterized by the assignment of a constant 
weight to all the arcs.The previous definition is able to identify many types of 
graphs, including cliques and circuits which seem the most reasonable shapes of 
graphs representing clusters and own some optimality criterion. 

We generalize our previous definition of cluster by the introduction of fuzzy 
weights which are neither a distance nor a dissimilarity index. 

As a general gramework we will consider graph theotetical language in order to 
simplify and imify all the concepts. 

First at all, a cluster can be represented by an oriented graph G(S,X), where S is 
a set of vertices representing the units,X the arcs,that is the ordered couple of 
vertices by a suitable weight which can be an integer number,both positive and 
negative ones,a rational number and a real munber,Marshall (1971), Bellacicco 
and Labella (1979).Following the previous definition,the arch connecting two 
vertices can represent a logic implication and the vertices of the graph can 
represent the attributes. 

An interpretation of the cluster is the mapping of each cluster on the set of 
books and more generally on the set of the units U: 

f:G(S,X)->U 

The mapping/ can be an homomorphism if / satisfies the obvious request so 
that the same fi-ame on the attributes can be obtained in the units.Actually,the 
same attribute can identify many units and a cluster of attributes identifies a 
cluster of units and viceversa ,Bellacicco and Labella (1979). We show in the 
mentioned book that in the boolean case the mapping is one-to-one and 
therefore the mapping/ can be an isomorphism. 
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We will generalize the mapping / to the fuzzy evaluation and therefore to the 
fuzzy implication. The fiizzy evaluation of an implication by (l),bounded sum, 
and by (2),bounded product, can be used as a weight of the corresponding arc in 
the graph. We can introduce the triangular membership function and we can give 
a fuzzy evaluation of the fuzzy implication .As it is well known,the triangular 
function is the following one: 

(p - a 

(D(p= 1- (5) 

P 

where a is the minimum evaluation of cp and P is the range of the evaluation (p. 
It is easy to see that (5) reduces to the evaluation of the complement of the 
fiizzy implication when P = 1 ,which is the length of the range of a fuzzy 
evaluation.Modifier (4) can reduce in any case a fiizzy evaluation to a fuzzy 
implication and we expect that the modifier can operate without changing the 
interpretation. Actually, we can obtain a boolean interpretation for every value of 
fuzziness and therefore the mapping/is preserved. 

A clustering algorithm can be reduced to the sequential use of the modifier (3) 
for u = n and v = l,and a tresholding operation (4) ,where the treshold z can be 
choosen in case we like to delete the lowest percentile of arcs. 

The fiinction to be minimized is ; 



f = li (1 - (p i/r) 



( 6 ) 



where i/r means : proposition pi in the cluster Cr- 

It is easy to see that each cluster is characterized in terms of minimum 
fuzziness. In other terms we like to maximize the best distinction between 
couples of clusters.We can now analyse the semantic network interpretation of 
the previous clustering approach and its neural network representation. 



3. Some Isomorphisms 

In the previous paragraph we gave a general outline of a clustering algorithm in 
terms of logical operations on a graph G whose vertices are fuzzy propositions 
representing events and whose arcs are fuzzy implications weighted by their 
fuzziness evaluated by (1) or by (2).The choice of the weight depends firom the 
type of optimization and in case we prefer to minimize the fuzziness and we 
adopt a triangular function ,we can adopt equation (1) which push the 
evaluation near the value l,that is the absence of fuzziness. We must spend some 
words about the evaluation of the fuzziness of the elements of the table TM,N 
and therefore of the vertices of the graph obtained by the introduction of a non 
sjnnmetric wei ghting of the arcs.We can follow the subsequent transformations; 
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Step 1. normalize each element Xy by the ratio Zy = Xy/max (j) , where by 
max(j) we mean the maximum value in the column j ; 

Step 2.take the average of the elements of each row Sj of the transformed data 
Zy .denoted Zj^ ; 

Step S.adopt the modifier (3), for k = 1/64, on the values Zjj >? ; 

Step 4.adopt the modifier (3), for k = 5 ,on the values Zy < ? ; 

Step 5. use two tresholds in order to obtain a crisp set of binary data,Cy ; 

Step 6. Consider each couple of S in Tn,m and introduce an ordering 
relationship based on the heighest frequency of true material implications of 
the type: 

Cy ->Cy and Cy -> Cy ,in the set of the m variables. 

Step 7. Evaluate the fuzzy implication by (1) considering the data Zy . 

We can now state that a clustering algorithm based on the previous steps on an 
oriented graph is actually an algorithm for detecting a set of semantic 
networks.Some final remarks regards the isorphism between a neural network 
NN and a semantic network SN.We can state some relations between neural 
networks and semantic networks provided a learning rule can be considered as a 
dynamical updating of the weights of the implications in order to accomplish 
the task.Semantic networks are a static device for describing the a set of 
events.We do not consider in this paper the role of the learning rule and we 
limit the analysis to the static behaviour of the two logical structures 
introducing the following statements : 

Statement 1 .A semantic network M has the same computational power of a 
neural networN with three layers; 

Statement 2. For a set of well formed formulas and some true propositions, it 
exists a neural network N with three layers with a suitable set of weights on 
the arcs; 

Statement 3. It exist a universal neural network for every set of propositions as 
inputs and for for any set of well formed formulas. 

We cannot give in this paper a demonstration of the three previous statements 
without introducing a specific formalization of the arguments which is beyond 
the aim of the paper. Neural networks are input-output systems able to produce 
an input-output transformation, which must be either close to a template or 
sharing some main features of the input.The last case is characterizing the 
neural networks with three layers where the hidden layer is the set of inferential 
rules which can both the modus ponens and the modus fo//e« 5 ,Freeman and 
Skapura (1992).Those rules are transformations of the input and the input is the 
set of the propositions describing the events represented by the units of the Tn,m 
table.The usual non linear transformation of the data is interpreted in terms of 
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fuzzy sets theory as the mentioned inferential mltsModus ponens is defined as 
follows: if ( Si is p ) then ( Sj is f=l-p ), and Si is w ) then Sj is i = i - i. In terms of 
fuzzy sets the previous inference rule implies a transformation of the fuzziness 
able to confirm the consequence both for the fuzzy evaluation and for its 
complement evaluation.We can see easily that the implication is modified from 
rule (1) and from rule (2) ,as we mentioned before. Afocftw tollens is just the 
reverse of the previous statement and in some way it is not a new inferential 
rule.Non linearity occurs because we can shift from a fuzzy evaluation to its 
complement and the previous rule can be considered like a step function in a 
unitary square which can be approximated by the logistic function;both the two 
functions are usually adopted in neural networks. 
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Abstract: This work describes the hierarchical classification procedure called 
‘fuz 2 y average linkage’ which provides a fuzzy partition of a group of units. The 
basic principle is that the average similarity of units linked to the same group 
must be greater than or equal to a certain pre-set similarity level. This method is 
applied to mortality rates by cause of death for men and women in the 1970s, 
1980s and 1990s. 
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1. Method 

The method proposed here, called, ‘fuzzy average linkage’, is a fixzzy 
hierarchical classification method producing partitions of a set of units to 
be classified. In order to clarify, what we mean by fuzzy partition, we shall use a 
scheme (Ricolfi 1992) in Table 1 which enables us to ^stinguish between the 
clustering methods based on the results they give, i.e., based on the membership 
function. This is, for each unit i-th, a multivariate function with G values ptg 
(g=l,...,G) where G is the number of classes of the partition or clumping, and 
Pig is called grade of membership of object i to group g. If all the grades of 
membership range between 0 and 1, we have a fiizzy classification; when these 
values are strictly 0 or 1, the classification degenerates into a hard or crisp one. 

Table 1: Classification of clustering methods based on propriety of membership 



function. 





Range of definition 


Sum of grade of 
membership 


Hard methods: {0,1} 


Fuzzy methods: [0,1] 


Partition: I^Pig=l 


Hard partition 


Fuzzy partition 


Clumping: I^Pif>l 


Hard clumping 


Fuzzy clumping 
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When the sum of grade of membership of i-th object to all clusters is exactly 1, 
the classification is a partition, however when this sum is greater than 1, the 
classification is a clumping. 

The similarity index between objects i and j, used in this paper is the Gower 
measure (Gower 1971): 

= ( 1 ) 

where K is the overall number of variables, p(k) is the non-negative weight 
associated with the k-th variable, Vy(k) is one minus the relative distance (i.e. 
the distance compared to the maximum distance calculated for that item) 
between pairs of units. The fuzzy methodology here proposed uses centroids of 
the variables and therefore only quantitative and ordinal variables can be used. 
The distance used here is the generalised Hamming distance (Ponsard 1985). 



2. The classification procedure 

Let S be a similarity matrix to which the following classification procedure is 
applied. 

Step 1: A search is made for the pairs of units with the maximum value of the 
similarity index (for some matrixes S, there may be more than one pair). These 
pairs form the initial groups. The centroids for each group are calculated as well 
as the values py for each unit, which will be inversely proportional to the 
distance of the unit from the group centroid (the further away the unit i-th is 
from the g-centroid the less the value of will be). It has to be observed that if 
the unit belongs to one group only, its membership function is equal to 1 only 
for this group, and is equal to 0 otherwise, to satisfy the necessary condition for 
having a fuzzy partition. 

Step 2: Excluding the values already taken into consideration, a further search 
in S is made for pairs of units with the maximum value of similarity. Let a be 
this maximum. The pairs identified may be composed of units which have 
already been considered in the previous step. For the sake of simplicity, we can 
assume that we have identified pair {VJ') and that unit V has already been 
inserted in a group. In this case,y' is inserted in the group to which V belongs 
only if the average similarity between j’ and all the imits composing the group 
is greater than a. Otherwise, units V and j' will form a new group. Once groups 
have been formed, the centroids and membership function are calculated as 
specified in step 1. 

Step 3: Step 2 is repeated imtil all the units form a single group. 
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3. An application to adult mortality 

The fuzzy average linkage technique has been applied to the study of mortality 
rate by cause of death in some European coimtries (Table 2). Standardised 
female and male mortality rates by cause of death (Table 3) have been used for 
the 30-64 age group for the 1970s, 1980s and 1990s. 

Austria Denmark Greece Luxemburg Sweden 

Belgium Finland Hungary Netherlands Switzerland 

Bulgaria France Ireland Norway United Kingdom 



Table 3. Causes of Death 

Cancer: 

Stomach cancer 
Colon and rectum cancer 
Tracheal, bronchial and lung cancer 
Other cancers 

Diabetes 

Diseases of the circulatory system: 
Ischemic heart diseases 
Cardiovascular diseases 
Other diseases of the circulatory system 

Diseases of the respiratory system: 
Pneumonia 
Bronchitis 



Diseases of the digestive system: 

Hepatic cirrhosis 

Other diseases of the digestive system 

Ill-defined causes and senility 

External causes: 

Accidents 

Suicide 

Other external causes 
All other causes 



Most traditional cluster analysis methods combine similar geographical areas 
into a single group, so that the group indicates a specific mortality profile and 
each area may belong to a single group even if it is somewhat similar to the 
geographical areas of other groups. The traditional classification implicitly 
assumes that there are no similarities in the mortality models of bordering 
geographical areas, while it seems reasonable to assume that neighbouring 
population groups are subject to causes of death related to similar types of 
individual behaviour or characteristics of the environment where they live. A 
“fuzzy” classification method enables us not only to identify groups of 
European countries having mortality profiles which are similar for most of the 
causes of death considered, but also to specify the degree of similarity linking 
geographical areas belonging to the same group. Since this method enables a 
country to be assigned to more than one group, we obtain a panorama of the 
geography of mortality in which the similar groups are not separated by rigid 
borders, but which can also overlap. 
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Since the application can not be fully described in this paper, we will provide a 
summary of the main results. For example, we can show the result obtained for 
data referring to women in 1990. 

We have to chose the level of similarity at which the groups obtained should be 
analysed, and this choice represents one of the most debated problems 
concerning hierarchical methods of clustering. Different rules caii be followed, 
but in general, in order to ensure sufficient internal homogeneity in the group, 
rather high levels should be chosen. The aggregative procedure for fuzzy 
average linkage produces a number of groups which, being fuzzy, can be 
different for each level. Therefore, we suggest choosing one of the highest 
levels since it is more probable that a lesser number of groups will be formed 
and less fuzziness shall be found. This does not mean that we caimot use a less 
subjective technique of choosing such as the ones proposed for hard clustering. 
We have therefore considered the similarity level a=0.82, where a indicates the 
maximum threshold allowed for average similarity between two vmits 
belonging to the same group, the 4 groups shown in Table 4 were formed. 

In order to provide the best interpretation for this result, we can calculate the 
averages - weighted with the membership function - of the variables for 
countries belonging to the same group (Table 5). We obtain the profile of a unit 
representing that group. For example, group 1, consisting of Ireland and United 
Kingdom only, has rather high rates of mortality due to cancer and circulatory 
diseases. In the other three groups, the differences are less evident since they 
are fuzzier, showing that while some differences remained in the 1990s, 
Northern and Southern Europe have reached comparable mortality rates for 
almost all causes of death. 

To conclude, we can briefly mention the result of the entire application. 
Countries showing a nil linkage value are the ones isolated. As we can see, the 
Eastern European countries and Denmark are isolated. This happens because 
their profiles are not similar enough with respect to the other European 
coimtries considered (i.e. their similarity does not exceed 0.82). 

The classifications obtained for the various years of observation range from a 
situation in which there are groups with little overlapping, thus indicating a 
certain degree of dissimilarity, to a situation in which there are few groups and 
a certain overlapping with high degrees of similarity. Some European countries 
remain exceptions with respect to the others. This applies to Ireland and the 
United Kingdom, almost always related, but especially Ireland. This also 
applies to some Eastern European countries, since this region is behind with 
respect to the changes occurring in the 20-year period (1970-1990). In South- 
Central Europe there is a gradual move towards “northern” mortality profiles, 
which explains the greater degree of standardisation among the European 
coimtries in 1990. This is due to the fall in mortality rate from diseases due to 
poor sanitary and health conditions, together with the rise in mortality rate due 
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to other types of diseases linked to the lifestyles of the more highly developed 
countries (Caselli 1993). 



ie 4: Woman fuzzy classification in 1990 at level a=0,82 


European countries 


Groups 1 




gl 


g2 


g3 


g4 


Austria 


0 


0,52 


0 


0,48 


Belgium 


0 


0,33 


0,38 


0,3 


Bulgaria 


0 


0 


0 


0 


Czechoslovakia 


0 


0 


0 


0 


Denmark 


0 


0 


0 


0 


Finland 


0 


0 


1 


0 


France 


0 


0,51 


0,49 


0 


West Germany 


0 


0,44 


0,56 


0 


Greece 


0 


0 


0 


1 


Himgary 


0 


0 


0 


0 


Ireland 


1 


0 


0 


0 


Italy 


0 


0,48 


0 


0,52 


Luxembourg 


0 


0 


1 


0 


Netherlands 


0 


0,53 


0 


0,47 


Norway 


0 


0,56 


0 


0,44 


Spain 


0 


0,47 


0 


0,53 


Sweden 


0 


0,33 


0,31 


0,36 


Switzerland 


0 


0,34 


0,27 


0,39 


United Kingdom 


1 


0 


0 


0 


Jugoslavia 


0 


0 


0 


0 
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Variables 


Groups 




gl 


g2 


g3 


g4 


Stomach cancer 


4.77 


5.31 


4.94 


5.44 


Colon and rectum cancer 


15.23 


12.43 


11.76 


10.88 


Tracheal, bronchial and 
lung cancer 


25.58 


11.69 


8.00 


11.52 


Other cancers 


121.09 


100.95 


105.01 


95.87 


Diabetes 


4.14 


4.69 


3.44 


4.37 


Ischemic heart diseases 


52.72 


21.56 


23.23 


21.58 


Cardiovascular diseases 


20.33 


14.12 


18.17 


15.71 


Other diseases of the 
circulatory system 


17.80 


18.96 


20.03 


20.21 


Pneumonia 


4.56 


1.90 


2.38 


1.74 


Bronchitis 


5.89 


3.58 


3.82 


2.92 


Hepatic cirrhosis 


4.69 


10.61 


14.16 


7.85 


Other diseases of the 
digestive system 


6.68 


5.03 


6.16 


4.44 


Ill-defined causes and 
senility 


0.74 


5.82 


6.96 


5.57 


Accidents 


1.39 


1.55 


1.82 


1.50 


Suicide 


6.13 


11.29 


15.18 


8.84 


Other external causes 


11.17 


12.36 


15.91 


12.19 


All other causes 


34.56 


21.42 


24.22 


19.74 
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Abstract: This paper presents a new algorithm for semi-fuzzy clustering that 
allows objects to belong not necessarily to all the clusters, but also to only one 
of them. The advantage of this new method is that fuzziness is not introduced 
for all objects but only for those that cannot be classified as belonging to a 
single cluster. The performance of the new algorithm compared to the fuzzy 
c-means algorithm is showed by an application on a data set. 

Key words: Fuzzy c-means algorithm, semi-fuzzy classification, soft 
clustering. 



1. Introduction 

In order to solve a clustering problem, it is necessary to distinguish between 
two types of data clustering. Traditional "hard" clustering, in which objects are 
allowed to belong to only one cluster, and "fuzzy" clustering in which each 
object belongs to all available clusters with different grades of membership. 
During the last few years many fiizzy clustering methods have been developed 
because the nature of most practical problems is “fuzzy”, that is so complex and 
diversified that it can hardly be interpreted using hard clustering. 

The problem with many fuzzy clustering algorithms is that they often give 
excessively fuzzy classifications which are difficult to interpret. This is mainly 
due to the fact that each object must belong to all the clusters. 

In order to solve this problem, some “semi-fuzzy” clustering methods have 
recently been proposed which allow objects to belong to several clusters with 
various grades of membership, but not necessarily to all clusters (Selim and 
Ismail 1984 and Kamel and Selim 1991). 

In the following paragraph a new semi-fuzzy clustering algorithm is proposed 
which, by modifying the fuzzy c-means method (Bezdek 1981), appears to 
provide better results without altering its main properties. 

Section 3 gives an example in which the performances of the two methods are 
compared. 



2. The semi-fuzzy c-means method 

One of the most well-known fuzzy clustering algorithms is the fuzzy c-means 
(FCM) method. It is appreciated for some properties (Bezdek and Hathaway 
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1988 ) and is particularly suitable for applications on large data set (lacovacci, 
1995 ). 

One of the chief defects of the FCM method lies in the fact that, given the 
number c of clusters in which n objects must be classified, it is necessary to 
determine each object's grade of membership of each of the c clusters so that it 
is greater than zero. It is thus not permissible for the relationship between a 
object and one or more clusters to be nil, except in the (very rare) case in which 
a object coincides with the center of a cluster, as in this case it is assigned to 
that cluster only. 

The chief consequence of this characteristic of the FCM method is that, at 
times, it produces classifications that are excessively fuzzy, i.e. whose overall 
fuzziness is far greater than that existing in reality. 

To reduce this disadvantage, a semi-fuzzy c-means (SFCM) method is proposed 
that allows not only the objects coinciding with the center of a cluster, but also 
other objects having given characteristics to be assigned entirely to a single 
cluster. 

This method proposes that object i be assigned totally to cluster k if, taking 4* 
to indicate the distance between them, where a is a parameter 

(>1) determined a priori, and rf,.* indicates the distance between the center of the 
k-th cluster and the center of the r-th cluster nearest to it. 

This formulation meets the intuitive need to classify as totally belonging to a 
cluster any object whose distance from the center of the said cluster is not only 
reasonably small, but which is at the same time sufficiently far away from all 
the other clusters. 

Accordingly, the SFCM method determines for each cluster a zone of 
gravitational attraction (determined as a function of the other clusters) and 
assigns all the objects located inside this zone to the said cluster. 

Let n be the total number of objects, c be the number of clusters and be the 
grades of membership of object i to cluster k. The final classification is 
obtained using the same iterative process followed by the FCM method, with a 
further constraint. It calculates the optimal centers of the c clusters and the 
optimal values of the grades of membership that minimize the following 
function: 

/.(U,V) = tX//;|x,-v*lf (1) 

k=\ /=! 

subject to: 

i=l,...,n k^l,...,c (2) 

c 

/=l,...,n 

k=l 



( 3 ) 
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//«=1 if Ik- -v*! < l/a|v,-vj r,kF=\,...,c (4) 

I II P P 

' I is an appropriate norm on R ; x e R is the vector of the i-th 

P P 

object; v € R is the vector of the center of the ^-th cluster and v e R is the 

k r 

vector of the center of the r-th cluster; tm is a scalar, w > 1; a is a scalar, a > 1; 
U ={u } is the matrix (n x c) of the grades of membership; V = [v ,..,v ] is the 

ik 1 c 

matrix (p X c) of the c centers of the clusters. 

The parameter m that appears in (1) is particularly important, since with m=l 
we obtain a hard classification (and the FCM algorithm coincides with the 
traditional c-means algorithm), whereas with w > 1 a classification is obtained 
whose fuzziness increases as the value of m rises. 

The algorithm that describes the SFCM method is as follows: 

Step 1: choose the value of m, a, and a small positive scalar 6. Select an 
arbitrary membership matrix U = {//;*}. 

Step 2: calculate cluster centers using the following formula: 

1=1 / 1=1 



Step 3: compute the distance matrix D,={4*}, where - v^| (i=l,...,«; 

^=l,...,c) and the distance matrix where (r=l,...,c; 

^l,...,c). 

Step 4: update the membership matrix U computing the matrix U 
for each /=!,...,« and do 

if c?,r =0 then //,,=! and ju„, =0 for all r 

else if < (l/<2) d^k for some r, with k^r, then /^,>.=1 and =0 for all r 
else compute 

Step 5: if |u - u| < d then stop; else goto step 2. 

The main advantage of the SFCM algorithm is that it produces a mix between a 
“hard” classification and a “fixzzy” classification, thus introducing the fuzziness 
not for all objects, but only in those in which the fuzziness is actually found. 

The SFCM method and the FCM method are compared in the following 
example. 
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3. Experimental results 

To compare the performance of FCM algorithm with that of SFCM algorithm, 
both algorithms have been applied to the data set composed of 22 objects 
shown in Figure 1 below. 

F igure 1 : Example data set. 
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It can be seen from Figure 1 that objects of data set are distributed in three 
separate and distinct clusters, except for objects 6 and 13, which are located in 
an intermediate position in relation to these clusters (note the position of object 
13 in particular). The groups are: cluster 1: object 1-5; cluster 2: object 7-12; 
cluster 3: object 14-22. 

These three clusters are clearly identified by applying to the data the FCM 
algorithm with c=3 and w=l. Choosing /«=1, we obtain the hard classification 
reported in Table 1. 



Table 1. 


Result of FCM algorithm (d=0. 


001, c=3. 


m=l). 










Grades of membership 






grades of membership 




Object 








Object 




^,7 




1 


1.00 


0.00 


0.00 


12 


0.00 


1.00 


0.00 


2 


1.00 


0.00 


0.00 


13 


1.00 


0.00 


0.00 


3 


1.00 


0.00 


0.00 


14 


0.00 


0.00 


1.00 


4 


1.00 


0.00 


0.00 


15 


0.00 


0.00 


1.00 


5 


1.00 


0.00 


0.00 


16 


0.00 


0.00 


1.00 


6 


1.00 


0.00 


0.00 


17 


0.00 


0.00 


1.00 


7 


0.00 


1.00 


0.00 


18 


0.00 


0.00 


1.00 


8 


0.00 


1.00 


0.00 


19 


0.00 


0.00 


1.00 


9 


0.00 


1.00 


0.00 


20 


0.00 


0.00 


1.00 


10 


0.00 


1.00 


0.00 


21 


0.00 


0.00 


1.00 


11 


0.00 


1.00 


0.00 


22 


0.00 


0.00 


1.00 
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We just need to look at the classification in Table 1 to see that hard clustering is 
unsuitable to describe the fuzzy situation of objects 6 and 13 which are both 
assigned to cluster 1. 

Using the FCM method with various values of m>\, it is with m = 2.7 that we 
obtain the classification that seems best able to represent the situation, as 
objects 6 and 13 are assigned to the three clusters to different extents depending 
on their greater or lesser nearness to them, and the remaining objects have a 
high grade of membership of the clusters, as they are located very near to their 
centers. The classification is reported in Table 2. 



Table 2: Result of FCM algorithm {6=0. 
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It is necessary to point out that the result of the FCM method too is not 
completely satisfactory. As a matter of fact, in order to improve the 
classification of fuzzy objects 6 and 13, the FCM algorithm introduces 
fuzziness for all other objects with the exception of object 18 which coincides 
with the center of the cluster. None of the other objects which nevertheless 
clearly- belong to a single cluster obtain the maximum grade of membership 
value. To obtain a good classification of these objects, it is necessary to take m 
with a value near 1, but in this case objects 6 and 13 tend to belong only to 
cluster 1. This type of disadvantage is not present in the SFCM method which, 
using the same value m = 2.7 and a = 3, gives the result shown in Table 3. 
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This classification seems to be the best to represent the real classification of the 
objects, as it does not necessarily attribute fixzziness to all of them, but only to 
objects 6 and 13, i.e. to those objects whose positions are really uncertain in 
relation to the various clusters. 

It is important to point out that both the FCM and SFCM algorithms are 
apparently independent of the choice of the initial matrix U. In both 
applications, for a fixed value of m and a, the initial matrix U has been changed 
many times achieving the same result. 

Lastly, for testing the performances of the SFCM algorithm on a real case, this 
method has been used to grade Italy's commimes (mimicipalities) according to 
their degree of urban/rural characteristics. The resulting classification, which 
has not been included here for reasons of space, appears easier to read and more 
immediately comprehensible when compared with the analogous classification 
obtained by the FCM method. Moreover, it corresponds more to reality than the 
official classification of 1ST AT (ISTAT, 1986) obtained using the traditional c- 
means method. 

Many initial matrix U were randomly generated to test the stability of the 
solution obtained. The result was always the same final classification. 
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Abstract; In this paper we have considered the possibility of appl)dng the 
theories of fuzzy sets and algebraic hyperstructures to feasibility evaluation of 
urban new qualification projects. We have studied a mathematical model to help 
detect whether to verify the validity of some choices during each projectual 
stage, or to single out the best project among a series of possible alternatives by 
estimating the measure in which the projects attain the economical, psycological, 
cultural, technological objectives. 

Keywords: Fuzzy Methods. Hyperstructures. Evaluation Urban Projects. 



1. Evaluation need for urban projects 

A prevalent kind of urban planning consists in an “integrated intervention 
strategy' in order to revive the city on the whole, by promoting economic 
activities through recovery and re-qualification of the existent resources in the 
urban territory. Preservation no longer involves the single monument, but the 
whole building tissue; this fact requires the need to achieve various and 
heterogeneous objectives. So we deem it necessary to accept right procedures 
for the evaluation of projects, in order to obtain information about the 
“feasibility' of the foreseen works. 

The last National Congress of Architects (Florence, March 1997) showed a 
great interest for a planning directed to quality of urban space and environment. 
This quality has to be pursued especially through a correct procedure of 
evaluation and comparison between some projects. 

Suitable instruments of feasibility evaluation are indispensable especially at the 
stages of preliminary planning or programming, when necessary to establish the 
required amount of resources. These evaluation procedures of projects have to 
aim at; 

1. justifying the destination of the available financial resources to preservation in 
comparison with other investments; 

2. making a selection of projects by estimating the measure in which a program 
of “integrated maintenance” improves the level of welfare; 

3. raising this measure by modifying and integrating the tested project. 
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At present, the fuzzy sets theory is considered useful to minimize the uncertainty 
about the influence of a decision on the validity of projects. 



2. Features of the model 

The question about urban re-qualification has to be fi’amed into a general view in 
which there aren’t only economic needs. Others important demands are: 
psychological, defense of historical-artistic values and of aesthetically visual 
qualities of areas, improvement of the standard of urbanization, removal of 
social degradation of areas. It is necessary to attain a “global” or “integrated” 
evaluation of projects, that is an evaluation inclusive of various essential 
demands simultaneously. 

Therefore we have deemed it necessary the use oi“MulticritericT techniques to 
render the evaluation “all-inclusive”. This is seen as the work to express the 
measure of attainment of the various objectives by peculiarity of the project. 
Our model, in fact, may be represented by a matrix that, in our case, permits to 
compare the various impacts (feasibility categories) with a series of objectives 
(criteria). These ones reflect the needs or the requirements (the aims) of the 
community with regard to each category. 

We deem it necessary to consider the following feasibility categories for an 
urban project: feasibility on environment (evaluation of the project on the effects 
on natural and built ambient), aesthetic-cultural (evaluation of the project on the 
effects on the historical, artistic and archaeological interests), economical 
(evaluation of the project to verify the increase of general welfare in the 
considered area), financial (evaluation of the project to control the measure of 
costs-retum ratio), technical (evaluation of the project on the effects on 
constraint of construction and regulation), social (evaluation of the project to 
verify the attainment of objects concerning the directly and indirectly involved 
men), procedural (evaluation of project from the contractual point of view and 
on carrying out). 

The feasibility categories or objects may be opportunely weighed so as to 
consider the importance of the various needs. The weight of each “feasibility” 
represents the respective importance within the limits of a predefined typology 
of project with pre-established priorities; whereas, the weight of each “object” 
represents the degree of importance of the different aims. The assignment of 
weights has to be made by listening to all of the involved men’s opinion: 

promoters, planners, politicians, The “effectiveness' of the project is valued 

with regards to each object included in each category; in other words, we 
establish the measure in which the tested project attain each object to every 
single category. 
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3. A fuzzy set mathematical model to compare projects 

Schimpeler and Grrecco (1968) consider a particular multicriteria method to 
estimate urban projects. We summarize this method using our s 5 nnbology. 

(a) We consider a set Q, called objective set and a function W: Q->[0,1], called 
weight function. VweQ, W(to) is the weight of the criterion © with regards to 
the model to estimate the benefits of the projects. The normalization condition 
ZW((o) = l is assumed. 

09 eQ 

(b) Let P be the family of the examined projects. For any PeP , we consider a 

function Ep; called efficaciousness function of P. V©e£l, Ep(©) is 

the grade in which P satisfies the objective ©. The function Tp = W Ep is called 
characteristic function of P. 

(c) The function U:PeP->^Tp(©) gR, called evaluation or utility function, 

09 EQ 

measures the global utility of the projects of P . 

In this paper we consider some modifications to the previous model. As set Q of 
objectives we assume the Cartesian product FxC, where F, called set of 
feasibilities, is the set of the feasibility categories for an urban project and C, 
called set of criteria, is the set of the different points of view to measure the 
grade in which each project attains any feasibility. The weight function is a 
matrix independent on the projects considered and, for any PeP, the 
efficaciousness function is also a matrix dependent on P. Besides 

• we consider the function (p:xeF->2 W(x,c) €[0,1], called feasibility weight 

ceC 

function and the function called criterion weight 

xeF 

function. We have the conditions 2] 9(x) = 1 and ^ y(c) = 1 ; 

xeF ceC 

• we fix a threshold function S: FxC->[0,l] such that V(x,y)eFxC, S(x,c):^ 
W(x,c). 

Therefore we explain the meanings of all concepts of our model in the fuzzy set 
theory. Besides we show that many interesting features of the model can be 
studied with the theory of hyperstructures. For these aims, we give some 
definitions. 

Definition 3.1. Let Q be a non empty set. A fiizzy set on Q is a function f: 
Q->[0, 1]. For any xeQ, f(x) is the membership grade of x to f 
In the sequel we denote by C a finite non-empty family of fuzzy sets on a set F. 
For any xgF, the number grc(x)=2^ c(x) is called membership grade of x to C. 

ceC 
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DefiMtion 3.2. Let C and C* be two families of fuzzy sets on F and let 
v|/:C->C* a bijection. Put c*=\|/(c),Vc€C. The fiizzy set Tc*:xgF-4c(x)c*(x) and 
the family of fuzzy sets Tc»={Tc*}cec are called, respectively, transmitted 
component of c and of C. The pair (C*,\|/) is called filter of C. 

By previous definitions we have the following interpretations; 

• the weight function W is a family of fiizzy sets on F; 

• the pair (Ep, \|/), with Ep eflBcaciousness function of P and \|/ function that to 
any column of W associates the column of Ep with the same index, is a filter of 
W and the transmitted component of W is the characteristic function Tp=WEp. 



4. An example about the application of the model 

The described model has been used to evaluate the social, economical, 
environmental, aesthetic-cultural feasibility about a project of widening on the 
main road n. 17, in the stretch connecting Poggio Picenze to Navelli, two towns 
in province of L’Aquila. 

The aim of the study has consisted in defining the measure in which an 
investment in that transport infi'astructure was really feasible and advantageous 
fi’om every points of view, given a system of constraints and objectives to 
achieve. 

This analysis gains importance because it is necessary to know the effects of 
accessibility on territorial growth about the optimization of housing, 
employment, easy access to public services. 

For a project in the sector of road system, the proposable feasibility categories, 
or objectives, are: 

0| : benefit for road users with regard to the investment; 

O2: positive influence on uiban environment; 

O3: positive influence on natural environment; 

O4: territorial innovation; 

Os: corre^ndence to territorial problems. 

Given the objectives, the specific criterions are: 

Ci: benefit for the usual users, in terms of improvement in the quality of transport, that is a 
higher q>eed on the roads, less trafflc congestion, reduction in travelling time, reduction in 
risk, reduction in discomfort; 

Cj: benefit for users induced to move by the reduction of transport charges in consequence of 
the improvement for the road; 

C3; infiuence (m urban landscape; 

C4: increase of accessibility to flie central zones; 

Cs; increase of enjoyment fit>m the central zones; 

Q; fluctuation in pollution; 

C?: extent of vegetation prejudiced by the realization of project; 

Cg; extent of damage to landscape in terms of gain or loss of visual amenity; 

C$: break produced in natural enviroiunent endowed with unity; 
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Ciq: new residential and productive units located in consequence of the urban program raised 
by the project on transit system 

C|i: multiple and balanced characteristic of the new piaimed settlements, on the basis of the 
multipliciQr and organicity of the settled functions; 

C12: property rtf’ the road network afler carrying out the project (property of the network to 
foster or to oppose a polycentric layout, with satisfiictorily scattered settlement on 
territory); 

C13; lack of transport infrastructure; 

C14: unsatis&ctory economic development; 

C15; compatibility of the project to tte territmial planning instruments. 



The related case of study is an exemplification of the mathematical model; 
indeed we assume only two alternatives: the situation without intervention, 
called “null hypothesis”, and the proposed project. 

Through a close examination of this case, by consulting texts about evaluation 
methods of public investments in transport, by listening technical opinions, by 
making inquires about streams of traffic and the variation of travelling time in 
the various examined sections of road, by estimating the fluctuation of transport 
charges in the situations “with” and “without” intervention, and by processing 
opportunely the other available data we have obtained the following weight 
matrix W and the following efficaciousness matrices Ep and Eo, respectively for 
the proposed project and for the null hypothesis; so we have 
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Then if Tp and To are, respectively, the characteristic matrices of the two 
alternatives, we have 
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The component Ui., of the marginal column of Tp or To, is the partial 

utility of the feasibility Oi. Therefore the component U.j, 15 of the marginal 
row is the partial utility of the criterion Cj. The sum of the Ui., equal to the sum 
the U.j, is the global utility of the project. 

We can observe that, in this example, the proposed project is very preferable to 
the null hypothesis, even if in different measure for each feasibility or criterion. 



5. On some hyperstructures associated to the model 

We assume the definitions and the terminology given in (Corsini, 1995), 
(Vougiouklis, 1994), (Migliorato, 1994), (Mature, 1997). 

Definition 5.1 A hypergroupoid. (S, a), is a non empty set S with a function 
a:SxS^P*(S)=P*(S)-{0}, called hyperoperation. The image of the pair (x, y) 
is noted xay and is called hyperproduct by x and y. 

For any pair (H,K) of subsets of S different fi"om 0, we denote by HoK the 
union of all the sets xay with x€H and yeK. The hyperproducts aoK and Haa, 
aeS, are considered equal, respectively, to {a}oK and Ha{a). 

VneN and V(xi,X2,..Xn)eS", the set 3a(xi,X2,..Xn) of all the hyperproducts 
generated by (xi,X2,..x„) is given, by induction, as follows: 3o(xi)={xi} and, for 
n>l, 3a(xi,X2,..x„) is the set of all the hyperproducts K=FaG, with 
F={xi,X2,..Xh}, GH={xh+i,X2,..x„}, he{l,2,..,n-l}. Any element of 3a(xi,X2,..x„) is 
called block of S generated by (xi,X 2 ,..Xn). 

Definition 5.2 Pl hypergroupoid (S, a) is said to be 
(HI) semihypergroup if, Vx,y,zeS, xa(yaz)=(xay)az; 

(H2) quasihypergroup if, VxeS, xaS=Sax=S; 

(H3) commutative if, Vx,yeS, xay=yax; 
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(H4) hypergroup if it is a semihypergroup and a quasihypergroup. 

Definition 5.3 Let (S, a) be a hypergroupoid. The pair (S, B), with B set of all 
the blocks of S, is called geometric space associated to (S, a). A polygonal of 
length 1 of B is a block and a polygonal of length m>l is a m-tuple (Ki, 
K 2 ,..,Km) of blocks such that KirKi*i*0, Vie{l,2,..,m-1}. 

Definition 5.4 A hypergroupoid (S, a) is said to be 
(Wl) weak commutative if, Vx,yeS, xaynyax^t0; 

(W2) weak associative if, Vx,y,zeS, xa(yotz)n(xay) <xz?t0; 

(W3) feebly associative if, VneN and V(xi,X 2 ,..x„)gS", the intersection of all 
the blocks belonging to 3a(xi,X2,..x„) is different from 0. 

Definition 5.5 Let (S, a) be a hypergroupoid and let 0i<^TcS. (T, a) is called a 
subhypergroupoid of S if, Vx,yeT, xaycT. A subhypergroupoid T of S is 
called subhypergroup if (T, a) is a hypergroup. 

Let 0?tF*cF, 0;tC*cC, K=F*xC’". For any P,QeP, we put 

PZkQ o V(x,c)gK, Tp(x,c)< Tq(x,c), P»kQ o PZkQ and QZkP. 

The Zk is a preorder relation and the »k is an equivalence relation. 

In the sequel we assume that there exist three ideal projects Pw, Ps and Po such 
that Tw, Ts and To are respectively, the weight function W, the threshold 
function S and the null function O. So, for any P,QeP, we have some pairs 
(A,B) of projects such that AZrP, AZkQ, PZrB, QZrB. 

Then we can define, for any P,QeP, the following hyperproducts: 

P5kQ~{AgP: AZrP, AZkQ and AZrBZrP, AZrBZkQ if and only if A«kB}j 
P cfKQ={AGP; PZkA, QZkA and PZrBZrA, QZrBZrA if and only if 
A»kB}. 

Let In={ l,...,n}. We can prove the following 

Theorem 5.1 For any n-tuple (Pi,...,Pn) of projects, let 5K(Pi,...,Pn)={AeP: 
AZkPs, Viein, and (AZkBZkP;, ViGl„) o A«kB} and let aK(Pi,...,Pn)={AeP: 
PjZkA, Viein and (PiZrBZkA, VieL) o AwrB}. The intersection of all the 
blocks of 38k(Pi, - >Pii) contains 6k(Pi, P 2 ,—,Pn) and the intersection of all the 
blocks of 3 „k(Pi, -.Pn) contains Ok(Pi, ...,Pn). 

So we have the following 

Theorem 5.2 The hypergroupoids (P,6k) and (P,ok) are commutative and feebly 
associative. 

We can prove, by examples, that in general (P,6k) and (P,aK) are not 
associative. 

Let P^ the set of all the projects P such that PsZrP. We have the 

Theorem 5.3 P* is closed under both the hyperoperations 5 k and ok and so also 

(P^,5k) and (P*,ok) are commutative and feebly associative hypergroupoids. 
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The h 5 T)eroperations 5 k and ok have some interesting interpretations. P5 kQ is 
the set of the better projects than all ones of lower quality with respect to P and 
Q, relative to the set K. They can be considered, for example, because of some 
economic necessities. PgkQ is the set of the minimal projects that have all the 
good properties of both P and Q, with respect to K. 

Theorem 5.1 implies that the intersection of all the hyperproducts 5 k or ok of a 
fixed number of projects is not empty. The projects belonging to such 
intersections are the ones of greatest interest. We can prove that such 
intersections may be different, respectively, from 5k(Pi,...,Pii) and aK(Pi,...,Pn). 
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Variable Selection In Fuzzy Clustering 



Maria Adele Milioli 
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Abstract: The aim of the present paper is to discuss methods for selecting a 
subset of initially observed variables in the context of fuzzy clustering. The sug- 
gested procedixre is based on the optimization of an objective function which is 
differently specified according to the purpose of the selection. Measure of clus- 
ter validity, a generalization of Rand index and distance between dissimilarity 
matrices are then proposed as proper functions to optimize. 

Keywords: variable selection, fuzzy clustering, measures of fuzziness, cluster 
validity. 

1. Introduction 

In cluster analysis, the importance of variable selection is well recognized. The 
problem is to select a subset of initially observed variables that would account 
for cluster pattern present in the data. Solutions to this problem are well known 
for classical cluster analysis (Fowlkes et al, 1988). The aim of the present paper 
is the generalization to the context of fuzzy clustering. In particular, two cases 
will be considered: a) the purpose of the selection is to identify and eliminate 
the variables that may only constitute a “noise” that masks clear cluster structure 
delineated by other variables; b) the need of selecting an appropriate subset is 
strictly a problem of reduction of dimensionality: the purpose is to eliminate 
redundant variables, i.e. to find a reduced subset of variables which reproduce 
the cluster structure generated by the complete set of the original ones “as well 
as possible”. 

2. Fuzzy clustering 

Fuzzy partitions can be defined as follows (Bezdek 1981): 

Definition 1 - A G-fuzzy partition of a set of n units is a (nxG) matrix (1 < 
G<n) 

U = {Uig) ( 1 ) 

where Uig is the value of the membership function of the i — th units to the g — th 
cluster (i = 1, 2 , ..., n;g = 1, 2 , ..., G) with the following constraints: 

0 < Uig <1 and Y^Uig = l 

9 

If Uig takes only the values 0 or 1, expression (1) represents a crisp partition of 
the n units. 
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Let X = {i — 1, 2, n; s = 1, 2, ...,p) denote the data matrix containing 
the values taken by p variables measured on each of n units. The variables of X 
could be either numeric or fuzzy. 

Starting from the data matrix X a distance matrix D = {dy }(i, j = 1, 2, ...n) can 
be derived following different approaches (Bandemer and Nather, 1992; Leung, 
1988; Zani, 1988; Zimmermann,1985). Like in classical cluster analysis, several 
fuzzy clustering algorithms can be applied to matrix D in order to obtain fuzzy 
partitions of the n units. 



3. Variable selection 

Let /c < p be the number of variables to be selected (k may be fixed or determi- 
nated iteratively letting k — p — 1, k = p — 2 , ...). The data matrix X can be 
written as a partioned matrix: 



X=[Xi,X2] (2) 

where Xi is the matrix of selected variables and X 2 collects the remaining (p—k) 
eliminated variables. 

The problem of variable selection can be formalized as follows (Milioli, 1993): 
(j){X : Xi, X 2 ) = max (Xi, X 2 G Cl) (3) 



where is a general objective function and Cl is the set of all partitions of the 
variables of matrix X defined in (2). 

Taking into account the distinction made in section 1, we consider several objec- 
tive functions which specify expression (3). 

If the aim of variable selection is the elimination of “noise” variables, the func- 
tion (j) can be represented by a measure of cluster validity (Bezdek, 1981; Libert 
and Roubens, 1983). Measure of fuzzy clustering validity are linked to the con- 
cept of “separation” and “fuzziness”, where more separation implies less fuzzi- 
ness. Taking into account the opinion of most authors which presume least fuzzy 
partitions to be most valid, the optimization of function (j) should lead to select 
those variables that produce the most separated or the least fuzzy partitions. The 
most known measure of cluster validity is the partition coefficient: 



F(u G) = 

' ’ n n 



(4) 



which takes value in [1/G, 1]. It is 1/G only at the equimembership partition 
[1 /G] \ it is one on all hard G-partitions of X. The normalized expression of the 
partition coefficient is given by 



F*(U,G) 



F(U,G)G-1 



G-l 



(5) 
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The objective function can then be specified as follows; 

F*(Ui,G) = max (Ui € T) (6) 

where F is the set of all possible G-fuzzy partitions (Ui) generated by the re- 
spective matrices Xi defined in (2). 

Remark 1 - The measures of cluster validity are affected by the number of clus- 
ters, even if a normalized index is used (Bezdek,1981 ). In this context, we assume 
the comparison is made among fuzzy partitions with the same number of clusters. 

When the input variables are numeric, the function (f) can be based on a general- 
ization of a statistic, such as Wilks’ A statistic or Pillai’s trace statistic (Fowlkes 
et al, 1988), which takes into account the grade of membership of the units to 
the clusters. In the decomposition of the total sums of crossproducts (T), the 
within group (W) and between group (B) fuzzy sums of crossproducts matrices 
are then defined as follows: 

G n 

W = {rOri} where Wrs = ^ ~ ^sg){Xir - XTg)Uig (7) 

gz=l i=l 

G n 

B = {brs} where brs = X) ~ - Xr)uig (8) 

P =1 2=1 

With reference to A statistic, we can define a further objective function: 

A(Wi, Ti) = |Wil / |Ti| = min (Ti € 0) (9) 

where 0 is the set of all possible total sums of crossproducts matrices associated 
to the set F defined in (6). 

If the aim of variable selection is merely the elimination of redundant variables, 
we give the following suggestions. 

A generalization of the Rand index (Milioli, 1994) can be used as an objective 
function in order to identify a subset of the original variables which allows us to 
obtain a fuzzy partition of the n imits (Ui) which reproduce “as well as possible” 
the one generated by the complete set of the initial variables (U). 

In order to define the index, let’s consider a set of n units Oi € A, and let P and 
Q be two different fuzzy partitions of A, with c and r clusters, respectively. The 
proposed index is based on the following: 

Definition 2 Given two fuzzy partitions P and Q of A, the pairs of units in each 
partition are treated in a similar way if the following conditions are satisfied: 

i) the pairs of units are classified together in both partitions in the same number 
of clusters; 

ii) moreover, the portion of the pairs of units classified together in at least one 
cluster (or in different clusters) is the same in both partitions. 
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Let S be the set of all the pairs of units which are classified together in the same 
number of clusters in P and in Q, and let s =card(S'). The portion of units 
classified together in the pair h = (i, j) of 5 in P and Q is defined as follows: 

£:/i(P) = ^ ^h(Q) — ' ^j's(Q) (10) 

s=l 9=1 



The quantities 



Sk(P) = 1 - 6ft (P) and (Jft(Q) = 1 - eft(Q) (11) 

are the portions of units of the /i — th pair classified in different clusters. 

The portions of the h — th pair classified either in the same clusters and in dif- 
ferent clusters, common to both partitions, are 

Eh = min[£ft(P), 6ft(Q)] and Aft = min[5ft(P), 5ft(Q)] (12) 



The total portion of the s pairs of units treated in a similar way in P and in Q is 
given by 

C'=^(Sft + Aft) (13) 

ft=i 

The similarity index between the fuzzy partitions P and Q is defined as the 
ratio between the portion (C) of pairs of imits treated in a similar way in both 
partitions and the total number of pairs of units: 



s( 




(14) 



It is zero if all the pairs of units are classified together always in a different num- 
ber of clusters; it is one if the ^ 2 ) pairs are treated in a similar way (according to 
definition 2) in both partitions. 

In the context of variable selection the comparison is made between Ui and U. 
The maximization 



s(Ui,U) = max (UiGT) (15) 

leads to select the subset of k variables which identify the fuzzy partition most 
similar to the one generated by the full set of the original variables. 

To face the problem without a preceding cluster analysis, the selection can be 
based on the comparison between the distance matrix relating to the p original 
variables (D) and the one determined with reference to a subset of k variables 
(Di). Relating to this topic, Bandemer and Nather (1992, pp. 153-157) introduce 
the definition of “neighbourhood” of variables which can suggest a reduction of 
the number of the original variables by deleting or combining variables allocated 
in the same narrow neighbourhood. 
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4. A stepwise algorithm for variable selection 

The maximization of expression (3) could involve a great amount of calculations, 
mostly if the number of variables to select is unknown, as often happens in ex- 
plorative analysis. To overcome this problem we suggest a stepwise procedure 
(backward elimination) whose iterative scheme is developed by the following 
steps: 

1) calculate the values of (^(X : (s = 1, 2, where X[«] is the data 

matrix without variable ; 

2) delete the variable, say t, for which is a maximum or a minimum, according 
to the selected objective function; 

3) recalculate the values of 0(XW: xj*!|X,), where Xs{s = 1, 2, is one 

of the remaining variables; 

4) iterate step 3 until a stopping condition is satisfied, e.g. when the desired 
number of variables has been reached or when the variation of the function in 
two successive iterations is less then a fixed threshold. 

5. Application 

As an example, the problem of variable selection has been analyzed with ref- 
erence to a set of 10 economic indicators related to the Italian regions in 1993 
(source: ISTAT): 

Xi= employment rate, in % 

X 2 = female employment rate, in % 

Xs= incidence of employees in agriculture on the total employees, in % 

X 4 = unemployment rate, in % 

X^= female unemployment rate, in % 

Xe= gross domestic product per inhabitant, in thousands lire 
Xj= final household consumptions per inhabitant, in thousands lire 
Xs= incidence of foodstuffs consumption on the total consumption, in % 

Xg= private expenditure for public performances per inhabitant, in thousands lire 
Xio= deposits per inhabitant, in millions lire 

The first five indicators reflect specifically the labour force of the Italian regions 
with attention to the women’s partecipation in socio-economic activities, while 
the last five indicators are related to income and wealth. A fuzzy partition of 
the italian regions according to the previous indicators has been obtained using 
the fuzzy G-means algorithm FANNY proposed by Kaufmann and Rousseeuw 
(1990) which has suggested a solution in two fuzzy clusters (see table 1) present- 
ing a value of the partition coefficient equal to 0.41. 

Table 1 shows that the first 12 regions, which belong to northern and central 
Italy, characterize cluster 1, because they have high membership to this group 
and very low membership to the other one. An inverse situation is shown for the 
regions of southern Italy, which characterize cluster 2. The region of Abruzzo, 
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which has almost equimembership to the groups, presents features similar to the 
regions of both clusters. 



Table 1: Fuzzy partition of the italian regions based on 10 indicators. 



REGIONS 




Ui2 


REGIONS 




Ui2 


Piemonte 


0.888 


0.112 


Marche 


0.825 


0.175 


Valle D’Aosta 


0.778 


0.222 


Lazio 


0.797 


0.203 


Lombardia 


0.863 


0.137 


Abruzzo 


0.518 


0.482 


Trentino A.A. 


0.757 


0.243 


Molise 


0.275 


0.725 


Veneto 


0.887 


0.113 


Campania 


0.145 


0.855 


Friuli V.G. 


0.848 


0.152 


Puglia 


0.137 


0.863 


Liguria 


0.743 


0.257 


Basilicata 


0.145 


0.855 


Emilia Romagna 


0.805 


0.195 


Calabria 


0.132 


0.868 


Toscana 


0.868 


0.132 


Sicilia 


0.138 


0.862 


Umbria 


0.722 


0.278 


Sardegna 


0.151 


0.849 



The stepwise algorithm for variable selection illustrated in section 4, choosing 
firstly the partition coefficient as an objective function to maximize, has been 
applied in order to reduce the degree of fuzziness of the initial partition, deleting 
indicators which may constitute a “noise”. 

Table 2 indicates, for each step of the procedure, the variable whose deletion 
causes the highest value of the partition coefficient and the related value of the 
index. For example, at the first step, the deletion of variable generates the 
highest increment of the value of the partition coefficient (equal to 0.43). In the 
second step, after the elimination of X 2 , the deletion of variable Xi leads the 
partition coefficient to a maximum value equal to 0.45. If we decide to stop the 
procedure at step 6 , we can see that indicators X4, X5, Xe and X 7 generate a 2- 
fuzzy partition of the italian regions where the two clusters are undoubtedly more 
separated than the two groups in the initial classification (F*(U, 2) = 0.54). 
Moreover, according to the stepwise procedure, the unemployment rate (X4) 
and the gross domestic product per inhabitant (Xg) are the two indicators which 
generate the partition of the italian regions with the lowest degree of fuzziness 
(F*(U, 2) = 0.57). 

The generalized Rand index defined in (14) has then been used with the purpose 
to eliminate redundant indicators. The results of the procedure are summarized 
in table 3. At the first step, the algorithm shows that the deletion of X 3 leads to 
a 2 -fuzzy partition of the italian regions almost equal to the one obtained from 
the entire set of indicators (s(Ui, U) = 0.992). The incidence of employees in 
agriculture is really redundant with respect to the other 9 indicators. If we stop 
the procedure at step 6 , the solution generated by X 2 , X 4 , Xe and Xg is still very 
similar to the one shown in table 1 (s(Ui, U) = 0.972). It is worthwhile to ob- 
serve that even only two indicators, X 2 (female employment rate) and Xg (gross 
domestic product per inhabitant) can reproduce very well the original classifica- 
tion s(Ui, U) = 0.951). 




Table 2: stepwise variable selection: partition coefficient. 



STEP 


DELETED VARIABLE 


F*(Ui,2)„,ax 


1 


^2 


0.43 


2 




0.45 


3 




0.48 


4 


^8 


0.50 


5 


^3 


0.52 


6 


^9 


0.54 


7 


^7 


0.56 


8 


^5 


0.57 



Table 3: stepwise variable selection: generalized Rand index. 



STEP 


DELETED VARIABLE 


U)max 


1 




0.992 


2 




0.989 


3 


^7 


0.989 


4 


^5 


0.989 


5 


^9 


0.981 


6 


Xio 


0.972 


7 




0.968 


8 


^2 


0.951 



In table 4 some results of both criteria are compared. The maximization of the 
partition coefficient leads to select subsets of indicators that really reduce the 
fuzziness degree of the obtained partitions, but it disregards their similarity with 
the initial solution. On the contrary, the maximization of the generalized Rand 
index leads to select subsets of indicators that reproduce the clusters structure 
satisfactorily, but it doesn’t improve the validity of the solution in the same way. 
Finally, it should be noted that, for the fuzzy classification of the italian regions, 
the gross domestic product per inhabitant seems to be a crucial indicator because 
it reproduces most of the information given by the other economic indicators 
and, at the same time, it generates well separated groups. 

Table 4: Comparison between procedures of table 2 and table 3. 



subset of selected indicators 


F*(U, 2) 


5(Ui,U) 


X4,Y5,Y6,Y7 


0.54 


0.9356 


^ 4 , Xq (from table 2) 


0.57 


0.9191 


Y2,Y4,Y6,Y8 


0.43 


0.9716 


X 2 , X$ (from table 3) 


0.43 


0.9511 
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Abstract: Spatially distributed observations occur naturally in a number of 
empirical situations; their analysis represents a significant source of theoretical 
challenge due to the multidirectional dependence among nearest observations. 
The presence of a dependence often causes the standard statistical methods, 
instead based on independence assumptions, to fail badly. This paper concerns 
the problem of discrimination and classification of spatial binary data. It 
presents a suitable discrimination function based on Markovian automodels and 
suggests a solution to the allocation problem through a Gibbs sampler-based 
procedure. 

Keywords: Binary spatial observations, spatial discrimination and 

classification, Logistic-Autologistic model, Gibbs sampler. 



1. Introduction 

Spatially dependent observations arise in a wide variety of applications, 
including archaeological and agricultural studies, pattern recognition and 
econometrics, epidemiological spreading and survey analysis (see e.g. Haining, 
1990). 

Traditional statistical techniques assume independence among the observations, 
a hypothesis which is obviously violated in all geographical and territorial 
studies, where “(••.) everything is related to everything else, but near things are 
more related than distant things”, as stated in Tobler’s first law of geography 
(Tobler, 1970). Several well-established models have been introduced in the 
statistical literature to take account of the multidirectional dependence being 
peculiar to spatial observations. For earlier developments see, for example, the 
seminal papers by Besag (1974) and Strauss (1977); for a more recent and 
comprehensive review see Cressie (1991). 

In this paper, we concentrate on the definition of a discrimination function for 
spatial observations, incorporating the notion of contextual information. 
Generally speaking, we can recognise in discriminant analysis two different 
aims: discrimination and classification. The term discrimination concerns the 
process of deriving classification (or allocation) rules from samples of labelled 
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units, while the term classification refers to the application of these rules to 
identify new units group membership (see Krzanowski & Marriott, 1995). 

The paper is organised as follows: section 2 is devoted to discuss the limits of 
classical approach to logistic discriminant analysis when we are dealing with 
spatial data. Section 3 introduces the definition of a suitable spatial 
discrimination function based on Logistic-Autologistic model. In particular, we 
discuss some istances related to parameters estimation using maximum pseudo- 
likelihood and describe a relevant solution, through Gibbs sampler (Geman & 
Geman, 1984) based algorithm, to the outlined spatial allocation problem. 
Section 4 presents the application of the proposed method to a real situation. 
The aim is to discriminate between pixels from a remote sensed image of 
Nebrodi Mountains (Sicily) using the information about a number of soil 
characteristics. The final section is devoted to the presentation of concluding 
remarks and to outline possible future research lines. 



2. The logistic approach 

The earliest ideas in logistic discrimination date back to the 60’s, but the 
essential features of the method as it is applied today were developed by 
Anderson (1982). The method assumes that the posterior probabilities of group 
membership for the i-th unit characterised by covariates vector 

are given by (see McLachlan, 1992): 



Pr(i; = l|X,) = 



exp^a + b‘X,^ 

^l + exp(a + 6'JV,)J 



Pr(i:=(^X,) = 



1 

1 + exp(a + A' A", 



( 1 ) 

( 2 ) 



where a is the intercept and b={bi,...,bg)‘ is a ^'-dimensional vector of 
parameters. Expressions (1) and (2) can be easily summarised as follows: 



Pr(l^=y,|A,) = 



exp[y,(a + A'A,)] 
1 + exp(o + 6' ) 



(3) 



According to previous expressions, we can write the log-odds as: 

‘"p.O; = o|.v,) 



= a + b‘X, 



(4) 
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The attractiveness of this approach is that the model can be applied empirically, 
whatever the nature of the involved distributions and even in cases where the 
distribution of A' is not known (cff. Anderson, 1982). 

To apply the method, we can estimate the parameters atb from the training set 
(discrimination phase), substitute these estimates in (3) and allocate i-th unit 
into the group having the highest posterior probability (classification phase) or, 
in term of log-odds, into the group having Y,=\ if (4) is positive and into the 
group having ¥,=0 if (4) is negative. 

The use of a logistic approach when we are dealing with spatial data is 
statistically incorrect, in consequence of their mutual dependence. In fact, 
closeness between units belonging to a certain group modifies our expectation 
of other units group membership. As a consequence, it seem reasonable to 
formulate a conditional model that links auxiliary information contained in the 
covariates vector Xi with the notion of spatial dependence (contextual 
information). 



3. The contextual approach to logistic discriminant analysis 

To introduce our model, let us start assuming that we are analysing a study area 
which can be discretized into M contiguous quadrate cells and that for N of 
these cells (N<M) we are able to assess precisely the group membership (we 
will refer to these N cells as the training set). 

This is only a working simplification, which can be relaxed, when analysing 
irregular lattices, by imposing a more general cormectivity structure between the 
sample units. 

In the case of binary data, each cell i of the training set is black (T,=l) if the 
characteristic under study is present and is white (T,=0) otherwise. 

Our aim is to label the other (M-N) cells on the basis of the information 
obtained from the training set (see Figure 1) 



Figure 1 : Art example of square lattice grid 




■ 


T<=I 


n 


o 

II 


□ 


cell not labelled 



In order to describe a model for spatial discriminant analysis, we can define a 
Logistic-Autologistic Model (in the following LAM, see also Arbia & Espa, 
1996) as: 
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?x{Y,=yJj=y^,j^N(iy,X,) = 







r 


\~ 


exp 


Yi 




+ b'X, 








J 


r 




f 


y 


1 + exp 




+ b'X, 


L 




\ j€N(i) 


y 



where y, and Yj are binary random variables, N{i) is the neighbourhood set of 
site i, a and b have the same meanings as before and the quantities ry represent 
interaction parameters referring to the cliques of size two (Haining, 1990). 

The LAM is also known in literature as autologistic model for large scale 
variation and non stationary autologistic model (see Cressie, 1991). But in the 
present framework, we prefer the LAM notation to show the conjoint nature of 
the model, obtained composing auxiliary and contextual informations. 

If we consider a LAM model based on the first order neighbourhood system 
(see Figure 2), the spatial interaction parameters ry can be decomposed as: 



f .. =z 

y y 



with 



n 




if Wy&N^ii) 
if Wy€N^{i) 



( 6 ) 



where r is a constant. 
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Ni(i)={j,l,m,n} 



So, the LAM can be written as: 



?r{y=yJj=yj,j^N{iy,X,) = - 





r f M 


exp 




a + r Y,Yj+b'XA 




\ 


K. yeViO) J _ 


r 




f y 


l + exp 


a + r YYj ^b‘X, 


L 




yeV,(/) J _ 



( 7 ) 



The previous model is used by physicists in ferromagnetism studies and is 
known as king’s Law (king, 1925). 
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With an irregular lattice, Wij can be specified in a niunber of different forms; for 
example, as (Cressie, 1991): 

1 / djj if Wy e A^(/) 

Wj,= = 1 (8) 

j^O if Wy i N{i) 

where dy represents the distance between neighbouring sites i and j. 

As stated in Arbia and Espa (1996), there are some statistical problems 
connected with model (5) that are still imsolved. The spatial discrimination 
function (5) can not be applied directly as function (3); in fact, in the spatial 
model the binary variable T, is present both as a dependent variable (on the left 
side of (5)) and as an independent one (on the right side of (5)). Hence, in the 
classification step, we should predict 7/ using the variables Xi (known for all the 
analysed units) and Yj which are not known for imits not belonging to the 
training set. So, when allocating units not belonging to the training set, we 
should use information about quantities which are not known. To solve our 
problem of spatial units allocation, we propose the following two-steps 
procedure: 

Step 1 {Discrimination phase): We estimate the model parameters via the 
maximum pseudolikelihood (MPL) procedure (Besag, 1975). Besag coined the 
term pseudolikelihood to indicate the product of the conditional probabilities 
(5). We obtain MPL estimates regressing 7, on the q+\ covariates 
(x,i,x, 2 ,...,X;^ , ^Yj) for the training set. All these variables are known for 

yeV,(i) 

units belonging to the training set. The maximum pseudolikelihood estimator is 
consistent (as shown by Geman & Graffigne, 1987) and gives good results in 
practical applications (Ripley, 1990). The pseudolikelihood estimates can be 
carried out using standard statistical packages (see Strauss & Ikeda, 1990). 

Step 2 {Classification phase): We consider the parameter estimates obtained 
from the training set as representatives of the whole lattice grid. Hence, we 
allocate the remaining cells through a simulation process that derives the 
autologistic model, including the covariates effects (known for all M cells). In 
this way, all cells are classified. This process is based on a simple Gibbs 
sampler procedure, simulating an autologistic random field with parameters 
equal to those estimated on the units of the training set. 



4. A real example 

The proposed method has been applied to a real data set, consisting of a remote 
sensed image from Landsat TM (Thematic Mapper). The image, covering a 
surface of 2,016 Km^, is composed by 2.240 (40x56, verticalxhorizontal) 
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quadrate cells (in the following pixels), each representing a ground surface of 
30mx30m, which is determined by the satellite spatial resolution. 

The aim was to discriminate between farmed surfaces (T/=l) and different 
surfaces (Ti=0) on the basis of a number of related soil characteristics; the first 
group is composed by lavic arboreous, citrus fruit, sowing and other arboreous 
farming (in particular olives and almonds), while the latter represents pastures, 
woods, thin woods and spontaneous vegetation. The related soil characteristics 
are: permeation capacity in saturation conditions (measured in millimetres per 
day), horizon thickness (in centimetres), mean yearly rainfall (in centimetres) 
and DTM (digital terrain model, in metres). 

The dimension of the training set was fixed in 200 pixels and the analysis was 
repeated adopting 4 different choices for the training set, to check the sensitivity 
of the proposed approach to different specification of the training set. To choose 
the training sets, we divided the lattice in four quadrants and sampled a pixel 
from each quadrant. The training sets were formed by rectangles containing 
12x16 pixels (vert.xhor.) and having the choosen pixels as one of their comers. 
In the four analysed training sets, the estimated values of interaction parameter r 
were in the range (0.6-0.7), so outlining a significant spatial dependence; 
furthermore, the values of -21og(likelihood), even if it can not be considered a 
fully correct indicator of model fit, for LAM model were always smaller than 
those obtained by standard logistic approach. 

The Gibbs sampler procedure generated realisations of an autologistic random 
field with parameters a, and r estimated on the training set. To take into accoimt 
the local information of each site /, the parameters at was defined as: 

a^=a + b‘X^ ( 9 ) 

The proposed algorithm was iterated 50 times for each of the 4 training set, so 
leading to 200 different allocation scenarios. Figure 3 summarises the obtained 
results in terms of the rightly classified pixels percentage (note that this 
percentage is less or equal to 71% in the case of logistic discriminant function). 



Figure 3: Distribution of rightly classified pixels percentage 
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Note that in more than 80% of the analysed cases the solution provided by the 
LAM model is significantly better than that obtained by a standard logistic 
approach to discriminant analysis. 



5. Concluding remarks 

In this paper we have outlined the obvious limitations that arise when appl)dng 
logistic discriminant analysis to spatially distributed observations. We have 
contrasted the classical approach with a more congruent one, based on the 
LAM, which takes into account the mutual and multidirectional dependence of 
geographical and territorial sites, together with auxiliary information available in 
the form of covariates measurements on the cells constituting the analysed 
lattice. In particular, we have suggested a simple algorithm, based on Gibbs 
sampler, to provide a feasible solution to the allocation problem in the case of a 
spatially located binary response variable. 

The algorithm proposed for the classification phase has been implemented in 
Matlab code an applied to a real data set, consisting in a portion of a remote 
sensed image regarding Nebrodi Mountains (Sicily). 

In this context, results obtained from LAM have been contrasted with those 
obtained via a standard logistic discriminant function, and seem to confirm that, 
when working with spatial data, there is no evident reason for leaving out the 
contextual information from the model specification. However, this case study 
should be considered only a particular one, with a major interaction among 
neighbouring units. Hence, to confirm these conclusions in the general case, 
more research is needed, in the form of a large-scale simulation study 
characterised by different values of interaction parameters, and/or by different 
sizes of the training set and of the whole lattice. 

It is worth noting that the proposed algorithm and, in general, the whole method 
of estimation and classification can be straightforwardly applied to the general 
situation of spatially located multinomial outcome variables (g groups). 
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Abstract: Symbolic data analysis aims at extending classical data analysis to 
data representing classes of individuals instead of single individuals. A major 
problem in symbolic data analysis is disciimination, that is the generation of data 
representing classes. Such data can be expressed as classification rules, which 
are learned from training examples. The paper addresses the problem of learning 
classification rules from examples described by both numeric and symbolic 
attributes, so that the discretization of the continuous-valued attributes is 
performed during the learning process. The proposed technique has been 
embedded into a classification learning system, named INDUBI/CSL, and tested 
on several data sets. 

Keywords: Symbolic data analysis, discrimination, discretization of continuous- 
valued attributes. 



1. Introduction 

Symbolic data analysis aims at extending classical data analysis methods and 
algorithms to more complex data well adapted to represent classes of individuals 
instead of single individuals (Diday, 1990). Indeed, in classical data analysis a 
single individual can be represented as a feature vector, where each feature (or 
attribute or variable) takes a single value (either discrete or continuous). On the 
contrary, symbohc objects can be represented as feature vectors with multi- 
valued variables and interval vaiiables, that is variables that take values in the 
power set of a given universe of objects. Therefore, a symbohc object is the 
description of all the properties valid for a class of individuals. 

Several studies have been performed on methods for analyzing symbohc data, 
such as feature selection (Ichino, 1994) and unsupervised classification learning 
(or clustering) (Brito, 1994). However, one of the main problems remains the 
generation of symbohc objects from examples. More precisely, given a set of 
mutuaUy exclusive classes, Ci, Ca, ..., Q, and a set E of individuals of such 
classes (i.e., E is a set of training examples), the problem is that of building a set 
of symbohc objects Hj for each class Q. Each symbohc object can be interpreted 
as a classification rule: 

if yi=Yi,y2=Y2 ym=Ym then class = Ci 
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where yj are the raulti-valued/interval variables used to describe the class Ci, and 
Yj are the corresponding set of values. A symbolic object is said to cover an 
individual when the value taken by each variable yj, i=l, in the description 
of the individual is a member of-Yi. For instance, the following symbolic object: 
if color_hat={black,red}, height=[10..20] then class = poisonous_mushroom 
covers the individual 

cok>r_hat=red, height=12, weight=6 

but not 

color_hat=black, height=12, weight=16 

Generated symbolic objects should satisfy two properties: 

1. completeness, that is all individuals in E of class Q should be covered by 
some object in Hi, 

2. consistency, that is symbolic objects in Hi should not cover individuals in E of 
class Cj foranyj^ii. 

Therefore, the generation of symbolic objects from examples is a typical 
discrimination problem, though ti'aditional techniques of multivariate data 
analysis, and in particular discriminant analysis, cannot be straightforwardly 
applied in this case. Indeed, such techniques are not able to handle either multi- 
valued or interval variables, and above all, are not able to produce classification 
rules whose body is a logical conjimction of conditions that single variables have 
to satisfy. Symbolic classification learning methods face this kind of problems 
and investigate properties of algorithms that generate symbolic classification 
rules. 

In the most part of symbolic classification learning systems, continuous-valued 
attributes are discretized before stalling the learning process. In the paper we 
address the problem of learning classification rules from both numeric and 
symbolic data, in such a way that discretization of continuous-valued attributes 
is performed during the learning process by a specialization operator. This 
operator has been embedded into a classification learning system, named 
INDUBI/CSL (Malerba et al, 1997a), and tested on several data sets. 



2. The learning strategy 

The learning sttategy adopted by ENDUBI/CSL to generate classification rules is 
called separate-and-conquer (see Figure 1). The separate stage of the algorithm 
is a cyclic process that searches for the current hypothesis Hi, expressed as a set 
of rules, by checking for its completeness. If the hypothesis is complete, the 
learning process for the class Ci is over, otherwise those examples in E covered 
by Hi are marked and the search for a new consistent rule to add to Hi is started. 
In contrast, the conquer stage performs a general-to-specific search to construct 
a new consistent rule. In each conquer phase a seed example e* of class Ci is 
chosen from unmarked examples in E. The seed guides the learning process. 
Each rule generated in the conquer phase should classify correctly at least e* 
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procedure separate_and_conquer(Examples, Ci) 

El ;= set of positive Examples of Ci 
E- ;= set of negative Examples of Ci 
Hi := 0 

while El 0 do 

ConsistenLRules := 0 
randomly select the seed e* from Ei 

ConsistenLRules := Beam_Search_for_consistent_tules(e*, Ei, E-, m) 
Best := FindBest(ConsistenLRules) 

Hi:=Hiu{Best} 

El := El - Covers(Best, Ei) 

endwhile 
return Hi 



Figure 1: High level separate-and-conquer search strategy. 

and, possibly, other training examples of class Q. Actually, the seed example 
plays the role of a prototype: Thus it could be appropriately chosen as the most 
representative case of Ci in E. 

It can be proven that the number of rules generated and tested by the procedure 
separate _and_conquer is polynomial in the number of training examples, m the 
beam search, and in the number of attributes used to describe a training example. 
The generation of the consistent mles is made by progressively ad^g 
conditions to the body of the rules, that is, by progressively specializing rules 
until they become consistent. More precisely, INDUBI/CSL starts with the rule 
having an empty body: 

if true then class = Ci 

that is complete but inconsistent since it classifies all training examples in class 
Ci. Then the system adds some conditions that make the rule more specific. 

The choice of conditions to be added is affected by conditions already selected in 
previous steps. This is an elegant way of handling interactions between attributes 
that can become complicated in traditional discrimination methods. 

When the condition to be added is of the form "attribute e interval" the problem 
becomes that of determining the best interval for the continuous-valued attribute 
involved in the condition. This is a problem of discretization of continuous- 
valued data. In symbolic classification learning, several methods have been 
proposed for the discretization of continuous-valued data, most of which 
operate off-line, that is discretize before starting the learning process. On-line 
methods for discretizing while learning have been proposed for classification tree 
learning algorithms (Fayyad & Irani, 1992), wMch adopt a different search 
strategy named divide-and-conquer. 

The discretization procedure implemented in our symbolic learning system 
operates on-line and has been designed having three distinct goals in mind: 

1. On-line discretization of continuous-valued data should be performed by a 
specialization operator, since our symbolic learning algorithm performs a 
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general-to-specific search in the conquer stage. The discretization procedure 
should help to specialize a clause by adding conditions of the type 

attiibute e [a..b] 

where [a..b] denotes a closed interval. 

2 . The discretization procedure should always guarantee to cover the seed 
example that guides the induction process. 

3 . The heuiistic function used to choose among different discretizations 
should satisfy some property that reduces the computational complexity of the 
discretization procedure. 

Details of the discretization process are illustrated in the next Section. 



3. The discretization procedure 

The discretization procedure staits with the construction of a table containing a 
pair {Value, Class) for each example, where Value is the value taken by the 
attribute in the example while Class can be either + or - according to the fact 
that the coiTesponding example is of class Ci, or not (we are supposing that the 
rule ciuTently learned concerns class Ci). The table is then sorted in ascending 
order on the Value field. 

Now the problem is that of finding the inteiwal that best discriminates positive 
from negative entiies in the table. Any threshold value a laying between two 
consecutive distinct values defines two disjoint inteiwals: The left interval Ui, I2] 
and the right interval [ri, 12]. The lower bound li of the left interval is the 
smallest value in the table with sign +, while the upper bound I2 is the largest 
value in the table that does not exceed the tlireshold a. On the contrary, the 
lower bound ri of the right interval is the smallest value in the table that exceeds 
a, while the upper bound r2 is the lai'gest value with sign +. When one of the 
two inteiwals contains no positive enuy, then it is set to undefined. However, at 
least one of the two intervals must be defined, since the table contains at least a 
+ in correspondence of the value taken by the attribute for the seed example. 
Not all definite intervals are to be considered, but only those definite intervals 
that include fhe value taken by the attribute for the seed example. Such intervals 
are said admissible, because they guarantee that the corresponding specialized 
rule still covers e*. 

The best admissible inteiwal is selected according to an information-theoretic 
heuristic, the information gain. By looking at the table as a source of messages 
labeled + and -, the expected infoimation on the class membership conveyed 
from a randomly selected message is: 



info(n*,n )= - — 



n -\-n 



■log2 



n'^ +n" 



n^ -t- n 



■logi — 



n +n 
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where n+ and n’ are the number of entries in the table with positive and negative 
sign, respectively. If we partition the table into two subsets. Si and S 2 , the 
former containing entries falling within an admissible inteiwal, while the latter 
containing the remaining entries, the information provided by Si will be close to 
zero when almost all cases have the same sign, + or Although the information 
prefers partitions that cover a large number of entries with the same sign, we 
must bias such a preference towai’ds intervals with a high number of positive 
entries, as well. The following weighted entropy: 



E(ni,nx )= ^inf o(ni ,n^ ) 
«i 



penalizes those admissible intervals with a low percentage of positive entries. 
The quantity 

gainin'^ ,n~ ,n[ )=■ info(n'^,n~) - Ein^ ,n^ ) 

measures the information gained by replacing the table with Si. A heuristic 
criterion to choose the admissible interval is that of maximizing the information 
gain, that is minimizing E(n\, nj ) . 

This criterion differs from that, adopted in other well known learning systems, 
such as IDS (Quinlan, 1986) and CN2 (Clark & Niblett, 1989), which do not 
weight the entropy. This difference is essentially due to the diverse search 
strategies (separate-and-conquer vs. divide-and-conquer) adopted by the 
systems. 

Note that not all cut points have to be considered. Indeed, only those between 
two consecutive distinct entries with different sign {boundary points) are 
examined. This choice is due to the following 

Theorem. If a cut-point a minimizes the measure , then a is a boundary point. 

This result helps to discard several computations of the gain by considering only 
boundary points, so improving the efficiency of the discretization procedure. 
Actually, the theorem above is similar to that proved by Fayyad and Irani (1992) 
for a different measme, namely the "unweighted" class information entropy 
computed in some decision tree learning systems. The proof of our theorem can 
be obtained electronically from http://lacam.uniba.it: 8000 /pagine/proof.html. 



4. Comments on experimental results 

The proposed discretization procedure has been tested on three data sets taken 
from the UCI repositoiy of machine learning databases 
(http://www.ics.uc.edur mlearn), namely Iris, Glass and Hepatitis. The performance 
of INDUBI/CSL is compared to that of C4.5rules (Quinlan, 1993), another 
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symbolic classification learning system. The design of the experiment is based on 
10-fold cross-validation. For each of the ten trials we collect three statistics, 
namely, the number of omission and commission errors, and the number of rules 
generated. Note that while INDUBI/CSL can distinguish between commission 
and omission errors, C4.5rules always classifies an instance in some class, thus a 
misclassification error automatically implies both a commission and an omission 
error. Statistics are summarized in Table 1. The predictive accuracy of 
INDUBI/CSL is comparable with that of C4.5rules: the detected difference is 
not statistically significant with respect to the non-parametric Wilcoxon signed- 
ranks test (Orkin & Drogin, 1990). As to the number of clauses, our system 
always produces a higher number of rules than C4.5rules does. This can be 
attributed to the fact that INDUBFCSL outputs sets of rules that are complete 
and consistent, while Quinlan's system produces rules that ai'e not necessarily 
correct with respect to the set of training examples. 



Table 1: Experimental results 



Dataset 


Average no. of errors 


p-value 
Wilcoxon 
signed-rank test 


Number of rules 




INDUBI/CSL 


C4.5mles 


INDUBI/CSL 


C4.5rules 


Iris 


2.0 


1.2 


0.1282 


9.4 


4.0 


Glass 


14.4 


13.8 


0.7670 


54.1 


13.6 


Hepatitis 


8.2 


6.8 


0.3329 


15.1 


5.7 



INDUBI/CSL has also been applied to a domain involving structured objects, 
namely document understanding, where ttaditional statistical techniques ai'e not 
straightforwai'dly applicable because of the difficulty to deal with relations 
between subparts of an object. According to the ODA/ODIF standard (Horak, 
1985), any document is characterized by two different structures representing 
both its internal organization and its content: The layout (or geometrical) 
structure and the logical structure. The former associates the content of a 
document with a hierarchy of layout objects, such as text hues, 
vertical/horizontal lines, graphic/photographic elements, pages, and so on. The 
latter associates the content of a document with a hierarchy of logical objects, 
such as sender/receiver of a business letter, title/authors of an article, and so on 
(see Figure 2). The term document analysis denotes the exti'action of the layout 
stinicture from the bitmap of a document, while the term document 
understanding denotes the process of mapping the layout structure of a 
document into the corresponding logical stmcture (Tang et al, 1994). The 
document understanding process is based on the assumption that documents can 
be understood by means of their layout sttuctures alone. 

The mapping of the layout structure into the logical structure can be represented 
as a set of rules. Traditionally, such mles have been handcoded for particular 
kinds of documents (Nagy et al, 1992), requiring much human effort. We 
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proposed the application of symbolic classification learning techniques in order 
to automatically generate the rules from a set of training documents whose 
layout components have been possibly labeled according to their logical meaning 
(Esposito et al., 1994). Actually, each training document generates as many 
training examples as the number of layout components. Classes of training 
examples correspond to the distinct logical components to be recognized in a 
document. The unlabeUed layout objects play the role of counterexamples for aU 
the classes to be learned. 

The description of each training example includes three numeric attributes 
{height, width, vertical _position and horizontal _position a of a block), one 
symbolic attribute {type of block) and four binary relations with other 
components of the layout structure {part_of, ontop, to_right and alignment) 
(Malerba eta/., 1997b). 

In order to compare the perfoimance of our learning system with that of 
FOIL6.2 (Quinlan & Cameron-Jones, 1993), another symbolic classification 
learning system that is able to deal with both numeric and symbolic atuibutes 
and relations, we have decided to organize an expeiiment as follows. We have 
considered a set of 30 business letters containing 364 layout components. The 
number of attributes and relations used to describe each document is variable, 
ranging from fifty to more than one hundred. For each layout component, we 
have associated at most one of the following labels: Logotype, sender, receiver, 
date, reference number, body of the letter, signature. Those layout components 
whose content is not significant have been left unlabelled. The experimental 
design has been based on a 10-fold cross-validation. Once again, for each of the 
ten trials we have collected three statistics, namely, the number of omission and 
commission errors, and the number of mles generated. Experimental results are 
summarized in Table 2. 

Briefly, we can observe that the average error rate, as well as the average 
number of rules, is almost the same for both the systems. The difference is in the 
type of errors: Foil6.2 discretizes into larger intervals than INDUBI/CSL, thus 
causing more commission eiTors. Although commission and omission errors are 
generally considered equally important, it is worthwhile to observe that in 
automatic document processing systems omission errors ai'e deemed to be less 
serious than commission errors, which can lead to unrecoverable errors in 
storing information unless a heavy human inteiwention. Moreover, we have also 
shown that a significant recovery of omission errors can be obtained by relaxing 
the definition of matching (Esposito et al., 1997). 



Table 2: Experimental results 



Average no. of 
omission/commission errors 


Average no. of errors 


Number of rules 


INDUBI/CSL 


Foil6.2 


INDUBI/CSL 


Foil6.2 


INDUBI/CSL 


Foil6.2 


2.6/0.3 


1.3/1. 5 


2.9 


2.8 


11.5 


11.0 
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Abstract: We consider the problem of parameter estimation in logistic 

discrimination. Our approach exploits the minimization of an error function 
based on distance measures between posterior probability distributions of the 
classes. In this context we analyze statistical properties of the Kullback-Leibler 
directed divergence and the euclidean distances from both theoretical and applied 
point of view. 

Keywords: Logistic discrimination, separability measures, parameter estimation. 



1. Logistic discrimination 

Classification is a wide area of statistical problems and methods. An 
interesting survey is given in Mineo (1986) who distinguishes three kind of clas- 
sification problems: cluster analysis, discrimination and mixture decomposition. 
Here we focus on discrimination, that is on the process of deriving classification 
rules from samples of classified objects, while the term classification refers to 
applying the rules to new objects of an unknown class. 

Let Cli and ^2 be two populations of objects and, for a; € Ufi 2 , let x = x(u;) 
be the m-dimensional feature vector of u (according to some suitable criterion). 
The classical approach to linear discrimination between and fl 2 is based on 
some linear function g{x, w) = a‘x -f 6 such that u is classified as coming from 
when a*x -f- 6 > 0 and from CI 2 when a‘x -f 6 < 0, where w = (a, b) G 
are parameters called weights and the notation ‘ denotes vector transpose; more 
generally the function g could be linear in some function h : U”* -> BP, that is 
g(x, w) = a‘h(x) -I- h. 

Linear discrimination can be also approached by considering the logistic 
sigmoid transformation il){z) = (1 -I- e~^)“^ which acts on the function a*x + b. 
In this case we obtain the discriminant function: 

y(x,w) = ^(a‘x-f6). (1) 

Linear discriminant functions having the form (1) are known in statistical litera- 
ture as logistic discriminant functions’, in neural networks domain they are real- 
ized by simple perceptrons, see e.g. Bishop (1995). The form (1) is still regarded 
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as a linear discriminant since the decision boundary which it generates is linear as 
a consequence of the monotonic nature of ip{z). The fundamental assumption of 
this approach is that the logarithm of the ratio of the group conditional densities 
of fii and ^ 2 ) say /i and / 2 , is linear: 

ln(/i(x)// 2 (x)) = a‘x + 6, (2) 

see e.g. Anderson (1982), Chapter 8 in McLachlan (1992) for details. The pa- 
rameters (a, h) in equation (1) are generally estimated according to the maximum 
likelihood method. The main problem of this approach is that the likelihood func- 
tion can be unbounded, for example if and ^2 are linearly separable then the 
likelihood function has not a unique maximum attained for finite a G R'^. 



2. Parameter estimation by probability distance measures 

Another approach to the estimation of the parameters (a, h) in (1) can be con- 
structed considering that, under the assumption (2), the values of y{\) given by 
(1) can be interpreted as probabilities. Let | x) denote the a posteriori prob- 
ability of class f^i, then it results: 

I x) = y{x) /i(fi 2 I x) = 1 - y(x) . (3) 

Moreover we can introduce a target probability distribution on fii U f22 defined 
as 



|x) = ln^(o;) i^(fi 2 |x) = 1 - ln,(a;) (4) 

where Ini(w) is the indicator function of fli. Let be N given 

objects from fii U ^ 2 . in particular cji, . . . , w„j come from and u^+i, 
come from ^2 (separate sampling). For each a;„ G fii U we set = 
I x,i); the value represents the identification value of w„. The set 

{(Xfi) ^n)}n=l,...,iV 

is called design set or training set. 

In this case the parameters w = (a, 6) in (1) are selected by minimizing the 
following quantity called error function: 

N 

^At(w) - Yl^iyn.^n) (5) 

n=l 

where yn = y(x„, w) = ■0(a‘x„ -I- 6) and (f){y, |) is a function which satisfies the 
following conditions: 

1- 0 > 0 for {y, G [0, 1] x {0, 1}, 

2- ^{y, 0 is with respect to y for each ^ G {0, 1}, 
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3- (f>{y, 0 = 0 if only ify = ^, 

4. (2^ - 1)^ < 0 for all y € (0, 1). 
oy 

We notice that assumption 4 means that if ^ = 0 then (f> is increasing for y € (0, 1) 
and if ^ = 1 then <(> is decreasing for y € (0, 1). In the spirit of Vapnick (1982), 
the function </» is here called loss function. Moreover we point out that, by (3) and 
(4), the loss function can be considered as a distance measure between probability 
distributions. 

In practical problems, a question of interest concerns the convexity of the error 
function ( 5). In fact, when the error function 5 at(w) is convex, the 
absolute minimum can be attained using simple steepest descent algorithms; on 
the contrary when it is not convex more complicated minimization 
strategies should be adopted. 

Theorem 1 given below is a necessary and sufficient condition for the error 
function £n{w) given in (5) to be convex; it generalizes analogous results given in 
Devouge (1992). The result is based on the following condition for a real valued 
function defined on euclidean spaces to be convex (see e.g. Appendix to Chapter 
3 in Cecconi and Stampacchia (1983) for details). 

Proposition 1 Let B be an open convex set of MP and f : B Rhe C“^. Then 
/ is strictly convex if and only if its hessian matrix 7i{f) is positive definite, 
namely if it results > 0 for each v € 12 ®, v 7 ^ 0 . 

Some preliminary notations concern matrix derivatives, see e.g. 
Chapter 10 in Liitkepohl (1996). The gradient vector of fjv(w) with 
respect to the weights w € 12 *, for some integer k, is the vector of the first 
order partial derivatives of 6^ given by: 

d£N{w) ^ /95at(w) d£N{y/)\ ... 

aw‘ [ dwi dwk ) 

and the Hessian matrix of the second order partial derivatives of £at(w) is the 
k X k matrix given by: 



9^£n _ / _£^£n_\ 

\dwidwj) 

Theorem 1 The error function (5) is a convex function of the weights for every 
learning set {(x,j, ^„)} of size TV > 1, if and only if the loss function <f) satisfies 
the condition: 






> 0 



for each {y, £) e (0, 1) x {0, 1}. 
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Proof. Without loss of generality, we can consider logistic discriminant function 
y{x) = ■0(a*x); in fact if we set x = (x‘, 1)' and a = (a*, 6)*, then we have 
a*x + 6 = a‘x. By Proposition 1, the function £n{^) is convex if and only if it 
results > 0 for each v € U"*, v / 0. Then we have to compute the 

hessian matrix of £n with respect to a. 

Preliminarly let us compute the gradient vector of £n- Let us denote = a‘x„ 
(and thus y„ = ip{zn)y, then (6) yields: 



d£N{a) _ r^d<f){yn,Q 

aa< - aa‘ ■ ‘ ’ 

Applying the chain rule of the gradient of real functions with vector arguments 
(see e.g. Liitkepohl, 1996) to each term of (8), we obtain: 



d£N{a) ^ d(l){yn,^n) dyn dZn ’^d^yn,^n) ,,, ^ t 



Afterwards we can compute the hessian matrix 'H{£n) of £n from (7). For sim- 
plicity, in the rest of the proof we set (j)n = ^{yn, ^n)- Thus: 



'H{£n) 



9a5a‘ 

dcpn 
[dyn 



N ^ 

= T — 

n 



fy'{Zn) 



X 



t 

n 



N 

E 

n=l 



■_a 

da 



d<t>n 

dyn 



'^'{Zn) + 



d(j)n d'lp'jZn) 
dyn da 



<■ 



Applying again the chain rule of the gradient of real functions with vector argu- 
ments, we get: 



N 



'H{£n) = 

n=l 

N 

= E 



^ X„ -I- ^^iy''{Zn) X„ 



n=l 



. dyl 
dyl ^ + 



dy, 

d(l>n 'lp"{Zn) 



dyn ‘>P'{Zn) . 



1p'{Zn)Xn^n 



Moreover, as ^{z) = (1 + e'")"!, then it results rP'{z) = - rp{z)) and 

ip"{z) = 'Ip'{z){l - 2tp(z)). Then, as we set yn - ‘ip(zn), it follows: 



n{£N) = E 



\d^4>n 



n=l 



d<l)„ 



dy2yni^ ?/„)+ 



2/n(l - yn) X„X^ . 



Now, for any v € let us consider the quantity: 



\^'H{£n)\ = v‘E 

n=l 



^ . . d(l>n 



dyl + 



2/n(l - 1/n) X„X> 
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E yni Vn) + Qy^ (1 Vn) 



= E 

n=l 



h [ 9Vn 
d‘^(j). 



N 



d(f>n 



dyl »«(!-!'«) + 3^(1 -2!/.) 



2/n(l - Vn) V*X„X^V 

yn(i - yn) (v*x„)2 . 



By Proposition 1, 'H{£n) is convex for every learning set {(x„,^„)}n=i,...,7v of 
size TV > 1 if and only if for each v G JR"*, v 0, we have \*'H{£N)'y > 0- In 
this last expression the quantity j/„(l — j/n) is stricly positive, then \*'H{£n)'^ is 
positive if and only if it results: 






> 0 



for each (y, G (0, 1) x {0, 1}. This completes the proof. 



Usually one considers the classical squared distance (j> 2 {y, ^) = (j/ ~ 0^- In re- 
cent years, some authors have proposed distance measures based on the Kullback- 
Leibler directed divergence, say (f>KL{y,0- In the next section we compare the 
properties of the error function £at(w) based respectively on the distances <f >2 and 
<I>KL from both theoretical and applied point of view. 



3. Error functions based on Kullback-Leibler type distances 

The Kullback-Leibler directed divergence between two probability distributions 
P = {pi, . . . ,Pn} and Q = {qi, . . . , on a finite set of n points is defined as: 

Skl{P,Q) = -Epfcln^. (9) 

fc=i Pk 



Properties of the distance measures of Kullback-Leibler type can be found in 
Kannappan and Sahoo (1992). In our case, expression (9) gives: 



SKiifJ-, i') = I x) In 






I^(flllx) ' ^ 1/(^2 I x) 

— I x) In p(fli I x) -I- u{Qi 1 x) In | x) + 
—u{Q ,2 I x) In fi{0,2 \ x) -f- t/{Q ,2 I x) In i'{Q ,2 \ x) . 



We have that u{Q,i | x) = 1 for x G fJj, with z = 1, 2; moreover we set 0 In 0 = 0 
as u In u — >• 0 for u — > O"''. Then, by (4) and (3), it follows: 



^kl{iJ;v) = — i/(Qi |x)lnp(fii |x) — i^(Q 2 |x)lnp(f 22 lx) 
= -^Iny- (1 -0M1 -y) • 

Given that ^ G {0, 1}, we can rewrite this expression as: 



v) = - ln(l + 2yi-^-y) . 
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Figure 1 : Plot of the functions Ay — 6y‘^ = 0, dotted line) and 8y — 6y^ — 2 

(^ = 1, solid line) 



We can introduce a loss function <j>KL{y-,0 equal to the Kullback-Leibler di- 
rected divergence between v and //, that is 4>KL{y,0 = = 

— ln(l -I- 2y^ — ^ — y)- By an application of Theorem 1, we find the follow- 
ing result. 



Proposition 2 The error function £at(w) is in general not convex when the dis- 
tance <j)2{y,0 adopted, whereas it is in general convex when the distance 
(l>KL{y, 0 is considered. 



Proof. The proof follows directly from Theorem 1. When (f >2 = {y - ^)^, then 
d(j) 2 /dy = 2{y — ^) and d‘^(t> 2 /dy‘^ = 2. In this case it results: 



2/(1 - y) 



d ‘^^2 






f 

-6y^ + Ay if ^ = 0 
-6y2 + 8?/-2 if^ = 1 



These two functions are plotted in Figure 1, in particular when ^ = 0 it results 
-6y^ -I- 4y < 0 for y > 2/3 and when ^ = 1 it results -6y^ -f- 8y - 2 < 0 for 
y < 1/3. Hence the condition of Theorem 1 is not satisfied. In the other case 
4>Ki{yi 0 — ~ i^^(l + ^ — y), after some algebra we obtain: 



y(i - y) 



d'^cpKL 

5y2 



+ (1 - 2y) 



d<f>KL 

dy 



1 . 



In this case condition of Theorem 1 is satisfied for all (y, ^) € (0, 1) x {0, 1}. 
This completes the proof. ■ 



4. Numerical studies 

From a theoretical point of view, Proposition 2 implies that the use of distance 




w-lrogsn 
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Figure 2: a) Values of urinary androsterone and etiocholanolone in in 1 1 healthy 
heterosexual (o) and 15 healthy homosexual males (+) in mg/24 hours. Decision 
surfaces obtained by minimizing error function based on (I>kl (solid line) and on 
(j >2 (dotted line), b) Sum of squared errors vs iteration: (j)KL (solid line) and on 02 
(dotted line). 

measure 4 >kl is preferable with respect to the euclidean distance 02 in order to 
estimate the weights w = (a, b) in (1). We have investigated also their properties 
from a practical point of view using some sets of real data. 

To begin with, we analyzed the speed of convergence towards the absolute min- 
imum of the error function (5) based on the loss functions 02 and <f>KL. The Figure 
2 a shows the values of urinary androsterone and etiocholanolone in healthy het- 
erosexual and homosexual males in mg/24 hours (data from Margolese (1970) 
reprinted in Hand (1981)), it is m = 2 , ni = 1 and ri 2 = 15. In both cases the er- 
ror function (5) was minimized 100 times by a steepest descent algorithm starting 
from a random point randomly chosen with uniform distribution on [— 1 , 1 ]'"+^ 

The numerical experiments showed that, adopting the 0 xl distance, the con- 
vergence to the absolute minimum of the error function is generally obtained in 
a smaller number of iterations than in the 02 case. A typical output is plotted in 
Figure 2, in particular we remark that the Figure 2b gives the plots of the sum 
of the squared errors vs iteration for both the loss functions. Actually, in order 
to make a congruent comparison between the two loss functions 02 and 0 ifx„ we 
compared the related sum of squared errors: this justifies the “flickering” in the 
error curve concerning 

Another numerical aspect of the minimization of the error function (5) con- 
cerns the importance of standardization of the data when they have different order 
of magnitude. It has been exhibited by using the data concerning three species 
of flea beetle Chaetocnema (data from Lubischew (1962)). The discrimination 
is based on the maximal width of aedeagus in the forepart and the front angle of 
the aedeagus. When the original data were taken into account the minimization 
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of expression (5) was critical; on the contrary a complete separation has been 
obtained when the input data were preliminarly standardized. 
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Abstract: In this paper a nonparametric method for discriminant analysis is 
proposed, based on a group separation oriented version of projection pursuit 
density estimation. Each population is separated in turn from the remaining 
ones, considered as a whole, by approximating the boundary between them 
through the composition of some informative directions, chosen according to an 
appropriate discrimination criterion. A coherent allocation rule is proposed, too. 
Simulation studies have shown that this method represents a valid solution for 
problems when the parametric approaches are not flexible enough and sample 
sizes are too small to use classical nonparametric methods. 

Keywords: Projection Pursuit, Projection Pursuit Density Estimation, 
Nonparametric Discriminant Analysis. 



1. Introduction 

The principal goal of discriminant analysis is to assign a new object to one of 
two or more groups n,,n 2 ,...,rig, on the basis of m measured characteristics 

x' =(xj,X 2 ,...,x„) associated with the object. An allocation rule is generally 
determined by optimizing an objective function, describing group separation, 
which is based on character densities in the various groups (McLachlan, 1992). 
When nothing is a priori known about such densities, nonparametric density 
estimation is called for (Hand, 1982). However, this approach is not free from 
serious drawbacks, as classical multivariate nonparatnetric density estimators, 
such as kernel and nearest neighbour, are heavily biased by what is generally 
called “the curse of dimensionality”. 

To overcome this problem Friedman, Stuetzle and Schroeder (1984) propose 
the use of projection pursuit methods for density estimation (ppde). The basic 
ideas of projection pursuit can be traced back to 1974 when Friedman and 
Tukey formulated the method as the search for the lineEir projections (whose 
generic vector coefficient will be denoted by a ) of a multivariate data set which 
maximize a user defined measure of interestingness which they called 
projection index (denoted by l{a). See Montanari, Guglielmi (1996) and 
Montanari, Lizzani (1997) for a thorough discussion). Within the density 
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estimation context this method suggests to approximate an w-variate density 
p(jc) by a density of the form; 

p‘^{x) = p\x)Y\fk{a',,x), ( 1 ) 

k=\ 

where is a “null ” model (e. g. a normal density with the same mean and 

covariance as p) and are the so called “augmenting functions” with 
e.'R"' determining xmivariate projections. 

As Huber (1985) suggests, the sequence can be viewed either 

“synthetically”, as a series of modifications to , ox “anal 3 dically”, as a series 
of modifications to p that “strips away” its structure step by step. 

The synthetic approach determines {/*,«*} such that the sequence 
/7*(x) = expressed as a recursion relation derived from (1), 

converges to p as fast as possible: at each step, it tries to attain the largest 
improvement of the current model p’‘~^ . Symmetrically, the analytic viewpoint 
starts with p^ = p and sequentially looks for {/*.«*} such that 

p^‘ \x) = p'‘{x)fj^~^{a^x) converges to p^ using the smallest number of 
steps. 

When, according to Friedman, Stuetzle and Schroeder, relative entropy (RE) is 
used to measure the quality of the approximation, in the former approach the 
optimization problem is solved by 

fk («it = P{<x)l p'‘-^ {a'^x) and = arg mg RE[p^ , p ^'^ ) ; 

in other words, since this optimal form of establishes marginal agreement 

along , the augmenting function is best applied to along the direction in 
which the current model and the objective function differ most. Its analytic 
counterpart is 

fk{a'kx) = p’'{a'kx)/p\a'kx) and a^=argmcaRE[pl,pi). 

Thus, while in the synthetic algorithm the projection index is based on a 
comparison between the data (coming from p) and the current estimate 
(updated at each step of the procedure), the analytic approach only involves a 
comparison between the density estimate (whose non-normal features are 
cleaned away step by step) and the null model p^ . For this reason, which will 
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be reconsidered later, in what follows reference will be made to the analytic 
approach and not to the synthetic one which has had a wider use in statistical 
literature. 

As it has been shown above, in Friedman Stuetzle and Schroeder’s approach the 
projection index (i.e. marginal relative entropy) is oriented to reconstruct the 
multivariate density in an “optimal way”, but this is not necessarily the best 
choice when the estimated density has to be used in the context of discriminant 
analysis. 



2. Projection pursuit and discriminant analysis 

The nonparametric method for discriminant analysis we present in this paper is 
based on projection pursuit density estimation, but resorts to a new projection 
index, more closely oriented to group separation. 

The idea of projection pursuit density estimation has already been applied to 
discriminant analysis by Polzehl (1995), who uses the expected overall loss of 
the allocation rule as a projection index. That solution generalizes a proposal 
due to Posse (1992), which was limited to the derivation of a two-group linear 
discriminant function. 

We share Polzehl’s view that in tire classification context there is some interest 
in choosing the same updating direction for all populations and corresponding 
densities, but we start from this consideration to derive a proposal whose 
methodological aspects profoundly differ from Polzehl’s. 

Instead of estimating g distinct densities for the g groups, our estimation 
approach regards the density mixture. This choice, which also has the further 
effect of giving more stable density estimates, is made necessary by our 
particular projection index. 

In order to define it we must first introduce the concept of “false neighbouring” 
observations along the direction a*. Two units i and j, coming from two 
different populations, are said to belong to the set of “false neighbours” in the 
direction, FN(a^^) , if their distance along is less than a threshold 
that has to be specified by the researcher: that is, 

i,je¥N{a^) if |a^x,. -a;xy|<FNt • 

As the goal of discriminant analysis is to separate groups as much as possible, 
those directions along which the number of “false neighbours” is minimized 
will be preferred. This requirement is however not strong enough for a 
discriminant analysis purpose as, the number of false neighbours being equal 
(Fig. 1), those directions along which the height of the density insisting on them 
is smaller will be preferred. 
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Figure 1 : Kernel density estimates of two homoscedastic mixtures having the 
same number of "false neighbours 





This suggests to compute the distance between “false neighbours” along aj^ 






p'^ia'kXj) 



where z,y eFN(a^), (2) 



and to use an average distance as the projection index to be maximized 



l{a^) = average[dy{ak)]. 

zj6FN(a*)'- ■' 



( 3 ) 



In order to make meaningful comparisons of different directions, the projected 
data should be standardized before computing the index. 

The projection index (3) closely resembles the measure suggested by Wong and 
Lane (1983) to solve a cluster analysis problem (see Calo (1997) for an 
application of ppde in this context). However, their measure, aimed at detecting 
Hartigan’s high density clusters (1977), was computed on the multivariate 
nearest-neighbour density estimate and defined for “any” neighbouring 
observations, the distinction between “true” and “false” neighbours being 
meaningless, in that context. 

The ppde-hasQd discriminant procedure using (3) as the projection index 
envisages two nested phases: a population separation phase and a population 
boundary approximation one. 

Within each step of the separation stage, the boundary approximation involves 
the search for those directions along which a given population is best separated 
from the remaining ones considered as a whole. The number of directions may 
be different for the different separation steps, and can be chosen by inspecting 
the projection index trend, stopping when it becomes stable or decreases. At the 

generic k-th step the highlighted structure is incorporated in p^ through the 
augmenting functions and cleaned away from the data by “gaussianizing” them 
(Friedman, 1987). 
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The separation phase is stopped after a number of steps which is the number of 
the populations under study (however, when only two populations have to be 
separated, one separation step is enough). This sequential procedure leads to p , 
that is the discrimination oriented density estimate of p , which will be used to 
derive an allocation rule. 

What has been said up to now shows that the projection index we propose 
involves the data only (and not a current model); this is the reason why, in this 
context, the ppde has to be performed following the analytic approach. 

When the results derived so far have to be used to devise an allocation rule the 
concept of “true neighbours” of a new unit z , that has to be classified, must be 
introduced. 

Supposing that qi directions have turned out to be necessary in order to 
optimally approximate the boundary between population fl, and the remaining 
ones, the generic unit j of the training set is said to belong to the set of z ’s true 
neighbours (TN) if it lies within a jyt/ neighbourhood of z , for / = 1, 2, . . . , g ; 
that is 



where Xq denotes the w-dimensional observation on z . 

Coherently with Wong and Lane’s use of the single linkage method in cluster 
analysis, the new unit z is then allocated to the population to which the nearest 
unit belonging to its TN belongs: 



n, if 3jen,: - 



X^o) p(xj} 



= min 
yeTN 



but different allocation rules could be devised, too. 

The performances of the method are heavily conditioned by the choice of t 
and , / = 1, 2,...,g. As plays the role of a smoothing parameter, 
analogous for instance to the window width of kernel density estimation, it can 
be selected through the classical criteria suggested for the latter density 
estimation method. On the contrary the definition of , / = 1, 2,...,g, is a 
little more tricky. Each element can be chosen, through cross validation, as the 
value minimizing the classification error rate within the training set. 
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3. Concluding remarks 

The proposed methodology (ppda), translated into a GAUSS algorithm, has 
been applied to a simulated data set for the two group case and compared to the 
classical linear and quadratic allocation rules and to two nonparametric 
procedures based on kernel density estimation and on PolzehTs solution. 

The following situation has been considered: 
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where S = diag(l, 1,25,25, 25), with sample size 60 for each population. The last 

three variables represent noisy variables and have been purposely included in 
order to prove the ability of projection pursuit methods to overcome the “curse 
of dimensionality”. The “false neighbourhood” threshold has been 

determined by the “rule of thumb” suggested for kernel density estimation and 
the projection index has been computed as the arithmetic mean distance 
between “false neighbours”. Along each direction, the data density has been 
estimated by univariate normal kernel estimators with bandwidth chosen 
according to Sheather and Jones (1991) method. Within the unique separation 
step, two directions have turned out to be necessary in order to approximate the 
boundary between n, and 112 > further directions do not give a relevant 
improvement in the projection index value. Probability of misclassification has 
been estimated by classifying 1000 additional observations for each population. 
The simulation results, reported in Table 1, are expressed in terms of mean 
probability of misclassification (MPMC). 



Table 1 : Simulation results on 100 replications 





MPMC 


std(MPMC) 


linear rule 


0.24 


0.0006 


quadratic rule 


0.24 


0.0012 


nonparametric rule (kernel) ^ 


0.12 


0.0015 


Polzehrs ppda 


0.03 


0.0011 


ppda rule 


0.03 


0.0017 



1 . The gaussian kernel window width has been estimated by maximum likelihood cross validation. 
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The obtained results clearly highlight the good performances of the ppde-hastdi 
discrimination rules. With respect to the other parametric and nonparametric 
considered methods, our solution is in line with PolzehTs. 

We have also compared the performances of our method and of Fisher’s 
discriminant analysis in the case of two 4-variate homoscedastic normal 
populations with identity covariance matrix and expectation given by 
\x.ij =\ = -\i 2 j, y = 1,..., 4, from which 10 samples of size «j =«j =100 have 

been generated. Table 2 shows the angles in degrees between the true optimal 
direction and its estimates by Fisher’s linear discriminant rule and by our 
method. Fisher’s results are only slightly more accurate than ours, but this is not 
surprising, since Fisher’s discrimination takes into account that both 
populations are multivariate homoscedastic normals while our approach is fully 
nonparametric. 



Table 2: Angles (in degrees) between the true optimal direction and its 
estimates by Fisher ’s method and our ppde-based discrimin ation procedure. 



Fisher 's method 


Proposed 

ppda 


5.077 


10.053 


8.289 


6.695 


4.484 


6.300 


8.633 


15.403 


4.600 


8.055 


3.580 


4.846 


5.805 


8.814 


7.408 


9.491 


7.845 


11.211 


5.187 


8.783 



The simulation results seem therefore to support the conclusion that the 
proposed classification method represents a valid solution for problems when 
the parametric approaches are not flexible enough and sample sizes are too 
small to use classical nonparametric methods. 
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Abstract: Methods for improving the predictive power of imstable classifiers 
based on combining multiple versions of these have received much attention in 
the last few years. The aim of this paper is to compare some of the proposed 
methods with a focus on neural network classifiers. Experimental results are 
provided to illustrate, in different data sets, the performances of different 
methods of combining the output of several neural classifiers. 

Keywords: supervised classifiers, combining, neural networks. 



1. Introduction 

The idea of combining predictors instead of selecting the single best has been 
studied by the statistical community for a long time (see, for example. Granger, 
1989). This idea implicitly assumed that one could not identify the “true” 
model, but different forecasting models were able to capture different aspects of 
the available information, with gains in accuracy. 

More recently, this concept has been extended to the combination of multiple 
versions of the same predictor. In fact, it is well known that some classification 
and regression methods “are unstable in the sense that small perturbations in 
their training sets or in construction may result in large changes in the 
constructed predictor'^ (Breiman, 1994). Improvement may occur by generating 
multiple versions of the predictor, by perturbing the training set or the 
construction method and then pooling the available outputs into a single 
predictor. Stacked generalization, introduced by Wolpert (1992), provides a 
method that uses cross-validation data and least squares to determine the 
coefficients of a linear combination of the different predictors. In the context of 
regression and classification trees, Breiman (1994) suggests to generate 
different training sets by making bootstrap replicates of the original learning 
set. Multiple classifiers are constructed using these different sets and then 
combined to obtain the so called “bagging” predictor by majority voting, 
whereas for regression problems the final solution is obtained by averaging all 
the available predictors. 

When classification is performed using a neural network, multiple classifiers 
can derive fi-om bootstrap samples. However, it is always necessary to train a 
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multitude of models for each sample, because different starting values for the 
parameters may lead to different predictors, and may result in differences in 
performances. A multitude of models is thus available using the same training 
set, just starting from different points in the parameter space. 

The main purpose of this investigation is to compare different ways of 
generating and combining multiple neural network classifiers. Section 2 
introduces neural network classifiers. Section 3 discuss the usefulness of 
combining and describes different methods of training and combining neural 
classifiers. Section 4 presents the results obtained in three data sets and discuss 
the performances of the different solutions. Section 5 gives the conclusions and 
directions for future researches. 



2. Neural network classifiers 



In the last few years, artificial neural networks have found an important role in 
classification. We will focus on the so called multilayer perceptron, which is the 
most popular type of network for supervised classification problems. For more 
general accounts of neural computation, see for example Hertz et al. (1991). 

The multilayer perceptron classifier performs a non linear transformation g(.) of 
the features vectors x = (x, ,...,x,)' into K outputs o = (o,,...,Ojj.)', that define 
the class membership of the objects. More specifically, in the so called single 
hidden layer perceptron the output function is defined as 

o=g(x,0) = F[«T'(y'x)] (1) 

where (9=(«, ,...,« ',...,y„')' , y=(r,,r 2 and «=(«,, are 
weights matrices of dimensions (i'+ljx/i and A^x(/ 2 +l) respectively, h is the 
number of hidden imits, 'F is some given non linear mapping from if* to if* and 
F is a vector fimction from to if^. For classification problems, if the 
activation function of the A:-th output unit is defined as follows 



Ok =^t(^.0)=exp(ai •'F(y'x)) 



Eexp(a*-T'(Y’x)) 



( 2 ) 



the output of the classifier can be interpreted as an estimate of the a posteriori 
probability of the ^-th class. 

The role of the learning or training process is to determine the best values for 
network weights 6 on the basis of a set of n observations, the training sample. 
This is generally done by optimizing some appropriate objective function over 
all observations. Different optimization procedures can be used and to avoid 
overfitting, the performance can be better measured either by cross-validation 
on the training data or by an independent validation set. 
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3. Combining neural networks 

Within the past decade considerable literature regarding the combination of 
classifiers has been accumulated. Besides the empirical evidence that shows 
that classification error rate can be reduced by learning and using multiple 
models, there are also relevant theoretical results (see, for instance, Perrone & 
Cooper, 1993; Hashemeta/., 1994; Breiman, 1994; Tibshirani, 1996). 

The improvement of the ensemble classifier performance relates to the 
bias-variance trade-off. Generalizing the decomposition of the prediction error 
in the regression context, several authors define the concept of bias and 
variance of a classifier. Tibshirani (1996), for instance, gives a decomposition 
of the prediction error of a classifier under general error measures and derives 
bootstrap estimates of its components. Unstable classifiers, just as trees and 
neural networks, characteristically have low bias, but high variance. It was 
shown that the main effect of combining classifiers is to reduce the variance and 
then the resultant predictor can be more accurate than the original one. As 
stated by Breiman (1996) for bagging classifiers, accuracy increases if the 
prediction method is good but imstable. On the other hand, aggregation can 
make stable but biased procedures worse. 

Generating multiple classifiers via bootstrap replicates of the training set 
requires a lot of computational efforts, particularly with computer intensive 
methods such as neural networks. In fact empirical evidence shows that a 
significant increase of a classifier accuracy cannot be obtained by considering 
only a few replicates. 

As an alternative to bagging neural classifier, we consider a straightforward and 
less time consuming way to generate a multitude of models using the same 
training set. Starting the training process of a neural classifier with different 
weight initializations can lead to classifiers with the same architecture but 
different performances. Instead of selecting the best one we resume the 
information of this set of classifiers into a single aggregated predictor. 

The most popular way of combining multiple classifiers is via majority voting 
that can be applied to any type of classifiers. 

Consider multilayer perceptron classifiers with output function (2). Since each 
object is assigned to the class whose output unit has the largest activation value, 
some information may be discarded by considering only the winning class for 
the multiple classifier. There can be significant differences in the activation 
values depending on the classifiers used, even if they select the same class. For 
this reason, other methods such as the average of the correspondent output 
values have been considered as combining techniques. This method implicitly 
assumes that all the networks are equally good but we might expect that some 
classifiers would make better predictions than others would. As in the 
regression context, we can think of making a weighted combination of the 
results and thus estimating the a posteriori probability of the A:-th class by 

p{k\x)=^,a jO,^i, where is the A:-th output of the y-th network. For 
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regression problems, the optimal set of weights can be determined by 
minimizing a sum-of-squares error function (see, for example, Hashem et al, 
1 994). In the classification context, this weight estimation does not guarantee 
that the estimated probability will stay between 0 and 1. The solution to this 

problem can be obtained by requiring that >0 Vy and = 1 . 

Moreover, considering a different error function, such as the cross-entropy is 
more appropriate with this class of problems and could lead to a better solution. 
The minimization of this error function, under the two constraints above, 
represents a more difficult problem that will be analyzed and developed in our 
future works. A different solution could be reached by relating the weights to 
the performance of the classifiers. We can assign to each classifier a weight 
inversely related to the error rate of that classifier in an independent data set. 
Two different sets of weights can be derived by excluding or not in the 
calculation of the error rate the observations that are assigned to the same class 
by each classifier. 



4. Some examples 

In this section we analyze the combination performance of different methods of 
combining by studying three different classification problems. 

The first data sets are derived from the synthetic waveform recognition problem 
presented in Breiman et al. (1984). It is a three-class problem with 21 
dimensional feature vectors. Three hundreds vectors were generated using equal 
prior probabilities. In order to compare the performance of different classifiers 
three hundreds new observations were generated for each class as a test set. 

The second example, the threenorm problem, is based on a two-class data set 
with 10 dimensional measurement vectors. Class 1 is drawn with equal 
probability from a unit multivariate normal with mean (a,..., a) and from a unit 
multivariate normal with mean (-a,..., -a). Class 2 is drawn from a unit 
multivariate normal with mean (a,-a,...,a-a ) , where a=2/(20)’'^ Two 
himdreds observations were generated as a training set and one thousand 
observations as a test set. 

The third example is a real classification problem, where the objective is to 
correctly identify different glass types. It is a six-class problem. Each of the 214 
observations consists of 9 chemical measurements on one of 6 type of glass. 
The data set are in the UCI repository of machine learning databases (ftp 
ics.uci.edu/pub/machine-leaming-databases). The sample was randomly split 
into a training set (107 units), a validation set (53 units) and a test set (54 units), 
on which the performance of the classification mles was measured. 

Multilayer perceptron, with a different number of hidden units, are used to 
classify the patterns of the three data sets. All the networks are trained until the 
classification error rate on a validation set reaches a minimum . 

For each data sets we calculate the test set error we will get on average when 
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we decide to perform only one run (singles column). This will be used as a 
reference with which the other methods will be compared. 

Table 1 reports in the singles column the average test set performance (standard 
deviations in brackets) over 10 runs of a single multilayer perceptron (MLP), 
which starts from different random initial set of weights. The leist three columns 
provide the combining results for three, five and seven classifiers. That is, we 
have considered all the possible combination of three, five or seven classifiers 
over the 10 available runs. We also determine on average the test set error of 
the network with the lowest error on the validation set among the different 
numbers of classifiers combined (minimum column). 



Table 1 : Test set error rate (%) for the waveform problem 



Waveform 


Singles 




Minimum 


Averaging 


W. Aver, 


Voting 


2 hidden units 


18.6 


3 


18.5 (1.4) 


16.3 (0.9) 


16.2 (0.9) 


16.9 (0.9) 




(1.4) 


5 


18.6 (1.5) 


15.7 (0.5) 


15.7 (0.5) 


16.2 (0.7) 






7 


18.4 (1.3) 


15.4 (0.2) 


15.4 (0.3) 


15.8 (0.5) 


3 hidden units 


17.3 


3 


17.1 (0.8) 


16.1 (0.6) 


16.1 (0.6) 


16.3 (0.8) 




(1.3) 


5 


17.5 (0.5) 


15.8 (0.5) 


15.9 (0.4) 


16.0 (0.5) 






7 


17.1 (0.3) 


15.8 (0.4) 


15.8 (0.4) 


15.9 (0.3) 


4 hidden units 


17.3 


3 


16.9 (0.2) 


16.9 (0.2) 


16.7 (0.3) 


17.2 (0.5) 




(0.6) 


5 


17.2 (0.5) 


16.8 (0.2) 


16.8 (0.1) 


17.2 (0.3) 






7 


16.9 (0.1) 


16.9 (0.2) 


16.7 (0.1) 


17.1 (0.3) 


5 hidden units 


19.5 


3 


19.5 (0.2) 


18.7 (0.4) 


18.7 (0.4) 


18.9 (0.4) 




(0.8) 


5 


19.0 (0.7) 


18.9 (0.5) 


18.9 (0.5) 


19.0 (0.6) 






7 


18.4 (0.7) 


18.6 (0.3) 


18.6 (0.3) 


18.8 (0.5) 



The results show that the performances of the simple and weighted average are 
similar. For the waveform problem the simpler structure with 2 hidden units is 
the one which reaches the greatest error reduction (-15,6%). One important 
observation that emerges from table 1 is that combining classifiers with the 
same (erroneous or correct) classification decisions provides little gain, 
regardless of the chosen scheme. This is particularly evident for the multilayer 
perceptron with 4 and 5 hidden units. The aggregate predictors perform better 
than the minimum only for 2 and 3 hidden imits. These architectures give also 
the aggregate predictors with the lowest error rates. 

The aggregate classifiers for the threenorm problem do not seem outperform 
significantly the single ones. To improve their performances we generated 
different training sets, resampling the origineil one by both cross-validation and 
bootstrap. The results in Table 3 show that, by combining a few classifiers, 
cross-validation gives the best results in terms of error reduction. If we compare 
this results to the ones of table 2 we can observe that with cross-validation 
applied to 2 hidden units we have an error rate similar to the one obtained by 
averaging architecture with 5 hidden units. The bootstrap’s worse performance 
probably depends on the structure of each resampled set, which contains only 
about 63% of the data on the average. The exclusion of a subset of observations 
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may reduce the individual classifier performance, negating any potential gain. 
We have to combine a large number of classifiers to obtain a substantial 
improvement, but from a computational point of view this is not a winner 
strategy. 



Table 2: Test set error rate (%) for the threenorm problem 



Threenorm 


Singles 




Minimum 


Averaging 


Voting 


2 hidden units 


24.5 


3 


23.8 


(0.4) 


22.7 


(0.5) 


22.7 


(0.4) 




(3.0) 


5 


23.9 


(0.6) 


22.7 


(0.5) 


22.7 


(0.5) 






7 


24.1 


(0.5) 


22.6 


(0.4) 


22.8 


(0.3) 


3 hidden units 


23.3 


3 


23.0 


(0.5) 


22.5 


(0.5) 


22.4 


(0.6) 




(1.2) 


5 


22.9 


(0.6) 


22.4 


(0.5) 


22.3 


(0.5) 






7 


23.1 


(0.5) 


22.2 


(0.4) 


22.4 


(0.3) 


4 hidden units 


22.9 


3 


22.3 


(0.9) 


22.2 


(0.6) 


22.2 


(0.7) 




(0.9) 


5 


22.1 


(0.8) 


22.1 


(0.4) 


22.0 


(0.5) 






7 


22.0 


(0.7) 


22.0 


(0.3) 


22.1 


(0.4) 


5 hidden units 


22.9 


3 


22.3 


(0.5) 


21.9 


(0.5) 


22.2 


(0.5) 




(0.9) 


5 


22.2 


(0.4) 


21.8 


(0.3) 


22.0 


(0.9) 






7 


22.3 


(0.4) 


21.9 


(0.3) 


21.9 


(0.3) 



Table 3: Test set error rate (%) for alternative methods of perturbing the data 



Threenorm 


Singles 




Average 


Voting 


2 hidden units 
















Training set 


24.5 


(3.0) 


3 


22.7 


(0.5) 


22.7 


(0.4) 








5 


22.7 


(0.5) 


22.7 


(0.4) 








7 


22.6 


(0.4) 


22.8 


(0.3) 


2>-foldCV 


23.7 


(1.8) 


3 


22.8 


(0.8) 


22.2 


(0.8) 


5-fold CV 


23.3 


(1.0) 


5 


21.8 


(0.5) 


22.5 


(0.7) 


1-foldCV 


23.1 


(0.8) 


7 


21.7 


(0.4) 


22.3 


(0.5) 


Bootstrap 


24.2 


(1.9) 


3 


23.1 


(1.1) 


23.5 


(0.7) 




24.2 


(1.9) 


5 


22.8 


(0.6) 


23.3 


(0.5) 




24.2 


(1.9) 


7 


22.8 


(0.5) 


23.4 


(0.4) 




24.4 


(1.8) 


30 


22.4 


(0.4) 


22.3 


(0.5) 


3 hidden units 
















Training set 


23.3 


(1.2) 


3 


22.5 


(0.5) 


22.4 


(0.6) 








5 


22.4 


(0.5) 


22.4 


(0.6) 








7 


22.2 


(0.4) 


22.4 


(0.3) 


5-fold CV 


23.1 


(1.3) 


3 


21.8 


(0.7) 


21.9 


(0.8) 


5-fold CV 


23.3 


(0.9) 


5 


22.0 


(0.4) 


22.7 


(0.5) 


1-foldCV 


23.4 


(0.7) 


7 


22.1 


(0.3) 


22.5 


(0.4) 


Bootstrap 


24.2 


(1.6) 


3 


22.7 


(0.8) 


23.0 


(0.9) 




24.2 


(1.6) 


5 


22.3 


(0.6) 


22.7 


(0.7) 




24.2 


(1.6) 


7 


22.2 


(0.4) 


22.5 


(0.5) 




24.2 


(1.3) 


30 


21.9 


(0.3) 


22.2 


(0.4) 








Ill 



The results for the glass data sets (Table 4) confirm that the lowest error rate 
can be obtained by combining classifiers trained on the same training set. The 
simplest architectures (2 hidden units) give poor classifiers, whose performance 
are not improved by aggregating. Also for this problem, a good bootstrap 
solution requires perhaps a great number of replicates. 



Table 4: Test set error rate for the glass problem 



Glass 


Singles 




Minimum 


Average 


W. Aver. 


Voting 


2 hidden units 














Training set 


32.5(1.9) 


3 


32.0(1.3) 


31.6(1.9) 


31.5(1.8) 


31.6(1.4) 






5 


31.9(0.8) 


31.4(1.7) 


31.4(1.6) 


31.2(1.2) 






7 


31.9(0.5) 


31.6(1.6) 


31.6(1.6) 


31.2(1.2) 


Bootstrap 


44.2 (6.3) 


3 




41.2 (2.5) 


41.3 (2.5) 


42.1 (2.6) 






5 




40.6(1.7) 


40.7(1.7) 


41.5(1.6) 






7 




40.0(1.7) 


40.2(1.5) 


41.1 (1.2) 


3 hidden units 














Training set 


32.1 (2.5) 


3 


32.3 (2.2) 


28.2 (2.2) 


28.1 (2.2) 


28.9 (2.5) 






5 


33.2(1.5) 


27.7 (2.0) 


27.5 (2.2) 


27.5(1.1) 






7 


33.7 (0.9) 


27.8 (2.1) 


27.2 (2.1) 


27.0(1.3) 


Bootstrap 


42.6 (5.9) 


3 




38.7 (4.1) 


38.2 (4.1) 


41.1 (4.3) 






5 




36.8 (3.3) 


38.7 (3.6) 


36.7 (3.6) 






7 




35.7 (2.6) 


35.4 (2.8) 


37.9 (3.0) 


4 hidden units 














Training set 


31.3 (3.4) 


3 


29.9 (2.6) 


28.8(1.8) 


28.7(1.7) 


29.6 (2.3) 






5 


29.4(1.8) 


29.1 (1.7) 


29.0 (1.8) 


29.7(1.9) 






7 


29.6(1.0) 


29.3(1.3) 


29.0 (1.7) 


30.1 (1.7) 


Bootstrap 


38.8 (6.3) 


3 




34.0 (3.9) 


33.9(4.1) 


36.0 (3.9) 






5 




33.3 (2.7) 


33.1 (2.9) 


35.4 (2.9) 






7 




32.9 (2.0) 


32.6(2.1) 


34.4 (2.4) 


5 hidden units 














Training set 


31.5 (4.5) 


3 


29.2 (4.0) 


29.5 (2.2) 


29.5 (2.4) 


29.8 (2.6) 






5 


29.0 (3.7) 


29.3 (2.0) 


29.3 (2.0) 


28.7 (2.0) 






7 


28.9 (3.2) 


29.4(1.3) 


29.5(1.8) 


28.4 (2.0) 


Bootstrap 


48.4 (6.7) 


3 




46.9 (4.3) 


46.0 (4.9) 


48.4 (3.2) 






5 




46.3 (2.9) 


45.8 (3.4) 


47.1 (2.8) 






7 




45.7 (2.7) 


45.0 (2.7) 


46.7 (2.2) 



5. Concluding remarks 

Combining multiple versions of imstable classifiers by resampling the training 
set is a variance reduction method, but a substantial improvement requires a lot 
of computational efforts, particularly with computer intensive methods such as 
neural networks. 

We have already stressed that for a neural network classifier one always needs 
to estimate multiple versions of the model. Our experimental results seem to 
confirm that combining the networks corresponding to different random initial 
conditions can reduce the error rate, while involving no additional 
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computational costs. The combined classifier often performs better than the best 
single classifier. 

Although the classification performances in some case are not dramatically 
better, all the combined results have a lower standard deviation and this means 
that they are less unstable and less dependent on initial conditions. 

One important observation that emerges from these experiments is that there 
does not seem to be a particular combiner that can be labeled “best” under all 
circumstances, although other methods of combining can be fruitfully explored 
in the context of neural network classifiers. Nevertheless, the results, based on 
straightforward methods, are quite encouraging because they provide a 
relatively easy way to improve a neural network classifier. 



References 

Breiman, L. (1994). Bagging predictors. Technical Report 421, Department of 
Statistics, University of California, Berkley. 

Breiman, L. (1996). Bias, variance and arcing classifiers. Technical Report 460, 
Department of Statistics, University of California, Berkley. 

Breiman, L. & Friedman, J.H. & Olshen, R.A. & Stone, C.J. (1984). 
Classification and Regression Trees, Monterey, Wadsworth and 
Brooks/Cole. 

Granger, C.W.J., (1989). Combining forecasts - twenty years later. Journal of 
Forecasting, 8, 167-173. 

Hashem, S. & Schmeiser, B. & Yih, Y. (1994). Optimal linear combinations of 
neural networks: An overview, in Proceeding of the 1994 IEEE 
International Conference on Neural Networks, vol. 3, IEEE Press. 

Hertz, J. & Krogh, A. & Palmer, R. (1991). Introduction to the theory of neural 
computation. Redwood City, Addison Wesley. 

Perrone, M.P. & Cooper, L.N. (1993). When networks disagree: ensemble 
methods for hybrid neural networks, in: Neural Networks for Speech and 
Images Processing, Mammone, R.J. (Eds), Chapman Hall. 

Tibshirani, R. (1996). Bias, variance and prediction error for classification 
rules. Technical Report, University of Toronto. 

Wolpert, D. H. (1992). Stacked generalization. Neural Networks, 5, 241-259. 




Selection of Cut Points 
in Generalized Additive Models 



Francesco Mola 

Dipartimento di Matematica e Statistica 
UniversM di Napoli Federico II. 

Complesso Monte S. Angelo 
via Cinthia, 1-80126 Napoli, Italy 
e-mail: mola@dms.unina.it 

Abstract: This paper offers, in the framework of generalized additive models 
(GAM), a proposal of a cut point selection for GAM smoothers that stems out of 
the CART like regression tree procedures. The proposal allows to find a 
parsimonious bin smoother (regressogram), a new smoother based on the well 
known loess smoother, and provides, moreover, the user with an additional 
information inherited from the regression tree methodology. The problem of the 
choice of span parameter is considered too. 

Keywords: Generalized Additive Models, Loess, Regression Trees, 

Regressogram, Spline Smoothers. 



1. Introduction 

Consider the vector of response measurements y = obtained at 

design points x = (x', where we assume that y/s represent 
measurements of some variable Y and x'j 's represent measurements of 
(vector) variable X. We assume, moreover, that Y is a response variable and X 
is a (vector) predictor(s). 

In Section 2 the idea of GAM is concisely summarized and in Section 3 
typical smoothers are reminded. Section 4 presents the same for the reciursive 
partitioning methods (of the CART type). In Section 5 is summarized our new 
proposal, introducing a new smoother based on the loess smoother proposed by 
Cleveland in 1979. 

Finally, in Section 6 an application on three data sets have been 
considered, analyzing both real and simulated data sets. An evaluation of 
different smoothers is considered too. 
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2. Generalized additive models 

The linear model holds a central place in the toolbox of applied statisticians. 
Simple in structure, elegant in its least squares theory and easily interpretable 
by its users, it is an invaluable tool. However, with the recent explosion in 
speed and size (of memory) of computers, it was possible to complement the 
linear model with many new methods that assume less conditions and therefore 
potentially offer to discover more aspects of the data. To this type of methods 
belong the so called generalized additive models, which form a generalization 
of the linear regression model. The central idea is to replace the usual linear 
function of respective covariates with an unspecified smooth function. The 
estimated model consists of a function for each of the covariates. The additive 
model consists of a siun of such functions. This model is nonparametric in that 
we do not impose a parametric form on the function but instead estimate them 
in an iterative manner through the use of the so called scatterplot smoothers. 
This is useful as a predictive model and can also help the data analyst to 
discover the appropriate shape for each of the covariate effects. 

The role played by smoothers in GAM idea is central. This means that we can 
focalize our attention on smoothers, since the choice of the smooth function 
becomes fundamental. 



3. Smoothers 

Scatterplot smoothers used in GAM is typically defined as a smooth function of 
X and y estimated in nonparametric way. The basic idea is to let the data to 
show the appropriate functional form themselves. More precisely, these 
methods try to expose the functional dependence without imposing rigid 
parametric assumptions between Y and X. 

Generalized additive models can be, among others, applied also to other 
data than those usually described by the standard linear regression model. 
Typical examples are binomial or binary response data, survival data and data 
from case/control studies. Many estimators of this type are proposed in the 
literature, see Hastie and Tibshirani (1990) for details; we shall concentrate on 
some of them, i.e., on bin smoother (known also as regressogram), locally 
weighted running-lipe smoother (currently called loess in S+), kernel smoothers 
and spline smoothers. 

It is well known that the regressogram mimics a categorical smoother by 
partitioning the predictor values into a number of disjoint and exhaustive 
regions and averaging the response in each region. Formally, we must fix cut 
points 



oo — Cq <c, <”'<c^ ~ 



+ 00 
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and define the indices of the data points in each region by 

Then the desired smoother s(.) is defined as 
5(Xo) = ave,^«^y, 

This estimator is, however, quite rough because it has jumps at each cut point, 
even if it is interesting fix>m a theoretical point of view. 

Regression spline smoothers offer a compromise by representing the fit 
as a piecewise pol 5 momial. Even if several configurations of splines can be 
used, usually are cubic polynomials that are considered. Respective regions 
defining the pieces are separated by a sequence of knots (brealqjoints). In 
addition, it is customary to force the smoothness of the curve at the knots. A 
serious practical problem is the choice of the knots. 

Locally-weighted running-line smoother (loess or lowess, Cleveland, 
1979) is based on the formula: 

K-^o)=a(«o)+P(*o)fo 

where d(x:o) and p(xo') are the estimates in the point Xq using a 
neighbourhood of x,, . Usually it is easy and natural to think of least squares 
estimates, but obviously it is possible to define alternative methods. 

A crucial point in the use of loess, and for all smoothers of this type, there is the 
choice of the neighbourhood. In fact it is important not only to specify if the 
neighbourhoods are symmetric (i.e., running-mean smoother) or not, but how 
big the neighbourhood has to be. In literature, this is known as the problem of 
the choice of the span parameter, i.e., the part of data to be considered in each 
neighbourhood. 

It is known that when the span increases, the smooth function is very smooth, 
but the estimates are not so accurate. On the contrary, when the span decreases, 
the estimate is very accurate but the smooth function is not so smooth. A lot of 
attention has been payed to this problem; however, it seems that a final solution 
has not been reached. 

The kernel smoothers are based on the local estimates at Xq points, considering 
either a “h” parameter to define the amount of data in which the estimate is 
computed, or a kernel function that identifies the distribution of data in the 
neighbourhood of Xq . 

A good source of the information about all these procedures can be foimd in the 
monograph of Hastie and Tibshirani (1990) and in the referenced literature. 
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4. Recursive partitioning methods 

Recently, more and more important role in the regression analysis play different 
tree based regression methods. Let us mention, among others, AID introduced 
in literature by Morgan and Sonquist (1963), CART described in Breiman et al. 
(1984), FIRM suggested by Hawkins (1990), two stage procedure of Mola and 
Siciliano (1992) or RECPAM described in a series of Ciampi’s papers. 

Tree-structured methods that use a recursive partitioning algorithm 
provide a powerful analysis tool for both exploration of the structure of the data 
and for the prediction of the outcomes of new cases. With tree-structured 
regression techniques, some of the restrictive classical assumptions about the 
relation between the response variable and the independent variables can be 
avoided. Moreover, a tree-structure provides easier interpretation than a 
regression equation or GAM since the regression tree identifies effects of 
covariates in subgroups whereas regression examines effects in the whole 
sample. 

From the nonparametric point of view, the results of the CART-like 
regression tree approach is nothing else than the bin smoother (regressogram), 
however, the way how to find cut points it is totally different from the classical 
regressogram estimator typically used by GAM smoothers, cf. Breiman et al. 
(1984) or Hastie and Tibshirani (1990). 



5. The proposed methodology 

One of the main problems of the above mentioned smoothers is the choice of 
the cut points for the regressogram, places of knots (for the smoothing splines), 
the span parameter for loess and, in different manner, kernel smoother. Several 
ideas are discussed, e.g., in Hastie and Tibshirani (1990), however, the 
possibilities mentioned there are not exhaustive. Moreover, almost all of the 
standard GAM smoothers are based on the classical non-parametric approach 
which balance the (dis)proportion when reducing both the variance and the bias 
parallely. 

Therefore, our idea is to replace during the construction of the “GAM 
smoothers” standard cut points resulting from the respective procedures by the 
cut points resulting from the corresponding (CART-like) regression tree. In this 
way the results of the regression tree procedure allows us to determine both cut 
points for regressograms and the knots for spline smoothers, neighbourhoods 
for loess and kernel smoothers. 

In figures 1 and 2 a schema of how our methodology works is shown. 
The first step is to perform the regression tree analysis on the observed data; in 
this way we obtain cut points (in the figure cl, c2, c3, c4, c5 respectively). The 
cut points allow us to identify the five terminal nodes in figure 1 denoted by 
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a, P, Y, 6, 8. In the figure 2 the scatterplot of the data set is shown and in 
addition the regions in which to perform the loess procedure identified by the 
regression tree analysis are marked. It is possible to notice that the regions 
changes as the cloud of points modify. 

Figure 1 : Regression tree 





What advantages and disadvantages this kind of methodology brings to us? 

On one hand, the solution is complemented by the information resulting 
fi'om the corresponding regression tree, giving us thus much more information 
about the problem than the classical GAM solution. Taking into accoimt of the 
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basic idea that the data should “speak” for themselves, this is often an extremely 
important advantage of our proposal. 

With this kind of approach, we do not impose an a priori choice about 
the cut points identification but we let to determine this choice by the data 
structure. In fact the regression tree gives us the advantage to detect points in 
which there are jumps. 

Another advantage is that our proposal is much more adaptable to the 
data and of special practical interest is its ability to cope with the dependence 
between Y and X. This feature is definitely superior to the classical GAM 
approach. 

On the other hand, both regressogram and classical regression tree (of 
the CART type) are not very smooth because they have jiunps at each of cut 
points. This disadvantage can be up to some extend overcomed by the use of 
regression splines or loess smoothers. 

Finally, notice that the fast computer is a must when we want to play 
with this type of estimators. This methodology obviously is faster then the 
Cleveland one since it uses for the estimation step only the part of data which is 
required. 



6. Application 

As an example of application, we have considered three data sets. The 
first one is a data set distribuited with the S-plus package (Venables and Ripley, 
1994 for details) and are due to Silvermann (1985) consisting of 133 
observations of acceleration against time for a simulated motorcycle accident. 
The second and third data sets have been generated by us: the former consists of 
200 pairs with four jumps imposed and identifying four regions in which the 
(x,y) pairs are very regular; the latter consists of 200 pairs and four jumps 
imposed, but differently by the previous one, imposing in the four region a lot 
of noise. In particular we compare the accuracy of a smoother obtained applying 
the proposed methodology (i.e., the loess) with standard smoothers, such as the 
original loess smoother and the kernel smoother. In order to evaluate the 
performance of these estimators the following commonly used measure has 

been considered: l/«^|y,- ~T*j • For sake of brevity figures of the scatterplot 

smoother have been omitted, but tables 1 to 3 (for the three data sets 
respectively) summarize the results of the comparison. Notice that the proposed 
smoother can be directly compared with the loess smoother since the span 
parameter is the same, whereas the kernel smoother, that dq>ends on the h 
parameter, cannot be directly compared. Nevethless, a global idea of the 
performance of this last estimator can be deduced, too. 

Comparing the results in the three tables, for motorcycle accident data 
set, simulated data set 1 and simulated data set 2 {with no jumps, with jumps 
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and regularities, with jumps and irregularities respectively), we can notice that 
even if the proposed smoother works ever better than the others, in presence of 
jumps or jumps and irregularities it works very well. This means that the 
procedure works well every time, but it is particularly appreciable in case of 
jumps and irregularities. It is also possible to notice that the variability for the 
proposed smoother is more less than the variability of the other smoothers, 
denoting the low influence of the choice of the span parameter and the 
neighbourhoods. 



Table 1:M 


[otorcycle data set 


Span 


Proposed 


Loess 


Kernel 


“h” 


(%) 


Smoother 


Smoother 


Smoother 


Parameter 


10 


14.59 


14.94 


15.86 


1 


20 


14.84 


15.91 


18.94 


2 


30 


15.20 


16.99 


26.81 


5 


50 


15.60 


22.51 


34.14 


10 


66 


15.94 


26.86 


37.53 


20 


99 


16.06 


33.33 


38.52 


40 



Table 2: Simulated data set 1 (jumps and regularities) 



BB8H 


Proposed 

Smoother 


Loess 

Smoother 


Kernel 

Smoother 


“h” 

Parameter 


10 


0.15 


0.41 


0.19 


1 


20 


0.21 


0.62 


0.28 


2 


30 


0.21 


0.81 


0.45 


5 


50 


0.23 


1.33 


0.69 


10 


66 


0.24 


1.57 


1.24 


20 


99 


0.24 


2.01 


2.09 


40 



Table 3; Simulated data set 2 (jumps and irregularities) 





Proposed 

Smoother 


Loess 

Smoother 


Kernel 

Smoother 


“h” 

Parameter 


10 


0.79 


10.81 


2.88 


1 


20 


1.16 


22.73 


5.53 


2 


30 


1.38 


33.25 


13.82 


5 


50 


2.96 


54.91 


27.50 


10 


66 


4.98 


63.31 


49.79 


20 


99 


8.52 


81.44 


82.47 


40 
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Abstract: This paper provides a methodology to grow classification trees when 
a multiple qualitative response is considered as a criterion variable. The latent 
budget model is used recursively to find ever finer partitions of cases into a 
prior fixed number of groups. The Akaike statistic is considered to select the 
most predictive model at each node of the tree. A finitful interpretation of the 
final decision rule is given through the Bayes rule. The proposed approach is 
also convenient to deal with multiple questions through the use of compoimd 
variables in the latent budget model. An application of the proposed approach 
on a data set taken from a Survey of the Bank of Italy is finally shown. 

Keywords: Multiple Response, Latent Variable, Mixing Parameter, Bayes 
Rule, Akaike Criterion 



1. Introduction 

In the field of categorical data analysis latent budget model is a reduced-rank 
probability model to decompose a table with constant row-sum data (i.e., 
compositional data, time budgets, conditional fi'equency distributions). The 
model is a mixture of conditional probabilities known as latent budgets and the 
mixture is defined by the mixing parameters (de Leeuw, van der Heijden and 
Verboon, 1990). The latent budget models can be fiaiitfully used for describing 
the dependence of a response variable due to a given predictor in two-way 
contingency tables (section 2.). For the latent budget model the response as well 
as the predictor can be also formed by more variables through the definition of 
suitable compoimd variables. Particular interactions among the variables can be 
tested considering restrictions upon the parameters (van der Heijden, Mooijaart 
and de Leeuw, 1992). An extension of the latent budget model to the analysis of 
a set of two-way tables is known as simultaneous latent budget analysis 
(Siciliano and van der Heijden, 1994; Tambrea and Siciliano, 1997). However, 
as soon as the number of predictors increases the application of the latent 
budget model, and in general of any parametric model, becomes not feasible 
due to a large number of parameters to introduce in the specification of the 
model. 




122 



A nonparametric approach is proposed in this paper in order to define a model 
in the form of a classification tree for prediction of a multiple response variable 
(section 3.). The tree-growing procedure is described in details providing also 
an alternative interpretation of the tree through the Bayes rule. An application 
performed on a data set taken firom a Survey of the Bank of Italy on Family 
Budgets is shown (section 4). The main advantages of the proposed approach 
are outlined in the final concluding remarks. 



2. The latent budget model 

For a two-way cross-classification of N observed cases according to the I 
categories of the predictor X and the J categories of the response variable Y 
let pjj be the observed proportion in cell (i, j) for i = ; j = 1,...,J with 

S j Pij “ ^ usual dot notation is used for svunmations, i.e., ^jPij = p.j • 

The interest pertains to the I conditional distributions or observed budgets 
Pj|j = Pij/Pi. and their departures from the marginal distribution p j . The latent 

budget model decomposes the theoretical budgets as a mixture 

K 

of K latent budgets: = ^^k|i^j|k> where the conditional probabilities 

k=l 

;r j|k for j = 1, . . . , J form the k -th latent budget and the conditional 
probabilities for k = 1, . . . , K are the mixing parameters for the i-th 
predictor category. For K = 1 the model reduces to the independence model, 
i.e., ;Tj|j = ;r j. In general, it holds K<K* where X*=min(/,/) is the 

maximum rank of the matrix with theoretical budgets (i.e., the saturated 
model). The latent budget model can be written in matrix formulation as 
P = AB' where the matrix A includes the parameters ;rk|i whereas the matrix 

B includes the parameters ;rj|k ; all the matrices are of proper order. The latent 

budget model is not identified: P = AB’= ASS'*B' for any Kx K matrix S 
which rows sum to one. Some restrictions are imposed upon parameters to 
make them identifiable; the number of restrictions depends on the number of 
firee parameters of the matrix S, this number being K(K-1). The degrees of 

freedom are given by l(J - 1) - [i(K - 1) + K(J - 1) - T(T - 1)] = (l - K)(j - K) . 

The estimate of the model can be obtained by maximiun likelihood under the 
product-multinomial sampling scheme as well as by an alternating least-squares 
method. The rank K is selected to have a good fit of the model to the observed 

data as tested by the usual likelihood ratio statistic G or alternative goodness 
of fit measures. 




3. The latent budget tree procedure 

3.1 Two strategies of analysis 
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The aim is to define an exploratory tree as well as a tree predictor for the 
response variable Y given a set of predictors observed on N cases. Both the 
response variable and some of the predictors can be defined as compoimd 
variables. The idea is to select at each node of the tree a predictor and thus a 
latent budget model which is used to find a partition of the cases into K groups 
where K is the number of latent classes. As it concerns the choice of K, two 
strategies are proposed. In the first strategy, this number can be fixed a-priori at 
the beginning of the procedure to be either K=2 or K=3 in order to grow binary 
trees or ternary trees respectively. In the second strategy, the number K might 
modify firom node to node and it is given by the lowest number corresponding 
to the most parsimonious model that fits to the data. Such recursive procedure 
yields to grow a classification tree called latent budget tree characterized by a 
sequence of latent budget models assigned to the nodes of the tree. 

3.2 The selection criterion 

The procedure selects at each node a predictor and thus a model considering the 
set of tables which cross-classify the response variable with each predictor at a 
time. As a selection criterion the Akaike Information Criterion (AIC) is 
considered in order to compare the fit of models with different number of 
degrees of fi-eedom. This leads to choose the predictor X with the associated 
model by maximizing the AIC criterion or equivalently by minimizing the 

corrected AIC criterion; AICx = G^(X)-2 df(X), where G^(X) is the 

likelihood ratio statistic for testing the model X against the saturated model. 
Obviously, in order to select a model with a fixed number of latent classes (first 
strategy) it is necessary that both J>K and / > X are satisfied; in particular 
for K=3 it is necessary that j>2 {multiclass problem) and I>2\ dummy 
predictors can be combined to form compound variables {multiple questions). 
Instead, according to the second strategy the model with the lowest K that fits 
to the data is selected applying a selection procedure starting with K=2. 

3.3 The partitioning criterion 

An important advantage of the latent budget model over other models for 
categorical data is that the parameter estimates are conditional probabilities 
adding up to one so that the interpretation simplifies. This property is 
considered for the definition of a partitioning criterion to apply at each node of 
the latent budget tree. The sample of N( cases at node t can be partitioned into 
K sub-groups on the basis of the estimates of the mixing parameters of the 
selected model. For the I x K matrix A which rows sum up to one (i.e.. 
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= 1) the I predictor categories are summarized into K latent budgets; 

the i-th predictor category is assigned to the k-th latent budget which presents 
the highest mixing parameter estimate. The partition of the I categories into K 
sub-groups induces the partition of the N j cases at node t into K sub-groups. 

For K=2 the I categories are divided into two disjoint subgroups which induce 
the partition of the cases into the left sub-group of cases presenting the 
categories with ^ 0.5 and the right sub-group with the remaining cases. 

3.4 The stopping rule 

Such partitioning procedure continues until the current node caimot be further 
partitioned either when the number of degrees of freedom of the current best 
model is equal to zero or when the number of cases is too low according to a 
fixed number. A node which is not further partitioned is declared terminal node 
and is assigned the label class corresponding to the highest proportion of 
observed cases falling into the node. Such exploratory tree describes the 
conditional interactions of the predictors with the response variable. 
Furthermore, a post-pruning procedure can be adopted to define a tree 
predictor to be used for classification of new cases of unknown class on the 
basis of the given predictors (Cappelli and Siciliano, 1997). 

3.5 The interpretation throngh the Bayes rule 

A further aid to the interpretation of the tree is provided by the latent budget 
parameter estimates. For the matrix B' which rows sum up to one (i.e., 
^j|k = 1 ) the K latent budgets are related to the J response categories. 

Each row of the matrix B' can be compared with the independence hypothesis 
given by the marginal proportions p j of the current table (prior probability 

estimates); for the k-th latent budget the response categories which departure 
more from the independence are those better predicted by the latent budget. 
Furthermore, the latent budget parameter for each j can be viewed as the 

posterior probability to fall in class j once that the case is assigned to the 
subgroup k. Using the bayesian rule we can write as 

7X ^ i^kli 

7t = ■ — - — — = — — which is the posterior probability of a case to 

XjTtjTIklj Ttk 

belong to class j given that it falls into the descendant node k. Starting from the 
root node of the tree, where are assigned the prior probability estimates given 
by the group proportions of the criterion variable, the posterior probability 
estimates are recursively updated and related through a chain of conditional 
probability estimates until the definition of the posterior classification in the 
terminal nodes. 
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4. Application 

As an example of application of the latent budget tree methodology a data set 
taken from a Survey of the Bank of Italy on the Family Budgets at the year 
1994 is considered. This consists of 850 families where the interest pertains to 
the study of the kind of payment prefered by the head of the family. In 
particular, four typologies of users are identified: 1. archaic user (who pays 
exclusively by cash), 2. classic user (who pays by cheque^, 3. evolving user 
(who pays by cheque and bancomat), 4. modern user (who pays by cheque, 
bancomat and credit card). These groups define the categories of a multiple 
response variable. The following variables with their categories are considered 
as predictors: title of study (1. none, 2. primary school, 3. middle school, 4. 
seconday school, 5. degree); age (1. lower than 35 years, 2. 35 - 45 years, 3. 46 
-55 years , 4. 56 - 65 years , 5. over 66 years); area of residence (1. North- 
West, 2. North-East, 3. Center, 4. South, 5. Isle); city population size (1. lower 
than 20 thousands of citizens, 2. 20 - 40 thousands of citizens, 3. 40 - 50 
thousands of citizens, 4. over 50 thousands of ciWz&as)', family income (1. lower 
than 30 millions of italian lira, 2. 30 - 45 millions of italian lira, 3. over 45 
millions of italian lira). 

The latent budget tree obtained performing the proposed procedure is shown in 
figure 1 where the boxes denotes the terminal nodes with the label classes 
below. The model selected at each node which fits to the data has always K=2 
latent classes providing a binary tree. Tables 1-5 describe the estimates of both 
the mixing parameters (the left block matrix A’ ) and the latent budgets (the 
right block matrix B' ). For each latent class (defining the sub-group of cases) 
the circled estimates of the mixing parameters (with value not lower than 0.5) 
indicate which predictor categories cases belong to in order to fall in such sub- 
group. The estimates of the latent budget parameters (the posterior probability 
estimates) are compared with the independent model (the prior probability 
estimates) shown in the last row of the right block matrix. Cases belonging to a 
given latent budget have high probability to fall into the response classes 
associated with the circled estimates of the latent budget parameters. 

For example, the city population size was selected as predictor for partitioning 
the 850 cases of node 1, with model fit 0^=1 .7 (df=4). Table 1 shows that the 
predictor categories are divided in two sub-groups bringing categories 1, 2, 3 to 
the left (circled estimates for k=l) and category 4 to the right (circle estimate 
for k=2). The partition of cases is induced by such a split of the predictor 
categories: families living in cities with 50 thousands citizens or less go to the 
left subnode whereas families living in large cities with more than 50 thousands 
citizens go to the right subnode. The left sub-group has high probability to be 
assigned to the response classes 1 and 2, whereas the right sub-group has high 
probability to be assigned to the response classes 3 and 4. 

Table 6 reports the posterior probability estimates for each terminal node: the 
values in bold are associated for each terminal node to the final response class. 
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Table 1: Latent budget model at node 1: G^=1.7 (df=4), AIC*=-6.3. 



node 1 


City population size 


Typologies of users 






i=l 


i=2 


i=3 


i=4 






■i=3 


j=4 


total 


k=l 


Qoo;> 


CO.75:) 


CO.8^ 


0.00 


Cp.3^ 


CO.28) 


0.25 


0.15 


1.00 


k=2 


0.00 


0.25 


0.18 




0.20 


0.13 


<3).2'D 


CO.4^ 


1.00 


total 


1.00 


1.00 


1.00 


1.00 


0.30 


0.24 


0.26 


0.20 


1.00 



Table 2: Latent budget model at node 2: G^=5.7 (df=6), AIC*=-6.3 



node 2 


Area of residence 


Typologies of users 






i=l 






i=4 


i=5 


msm 


msm 


msm 


msm 


total 


k=l 






MEM 




KR9 






0.04 


0.08 


1.00 


k=2 


CEESS 




ISlilISfifl 


0.33 


0.26 


0.18 


0.25 






1.00 


total 


1.00 


1.00 


1.00 


1.00 


1.00 


0.30 


0.25 


0.26 


0.19 


1.00 


Table 3: . 


Latent budget model at node 5: G2=0.6 (df=2), A1 


:c*=-3.4. 




nodes 


Area of residence 


Typologies of users 








i=l 


ma 


i=4 


msm 


msm 


msm 


■51 


total 


k=l 


0.00 


_ml 


cm 


KBS 


EM 


0.21 


0.18 


1.00 


k=2 






0.00 


0.17 


0.24 


EM 




1.00 


total 


1.00 


1.00 


1.00 


0.21 


0.26 


0.32 


0.22 


1.00 



Tal 



ble 4: Latent budget model at node 10: G^-3.9 (df=6), AIC*=-8.4. 



node 10 


Title of study 


Typologies of users 






i=l 


m 




i=4 


i=5 


msm 




msm 


mem 


total 


k=l 


0^0 


0.09 




KBS 




0.00 


0.22 


K^ 


EM 


1.00 


k=2 




Mm 


0.47 


0.19 


0.00 


KB9 


CM 


0.14 


0.00 


1.00 


total 


1.00 


1.00 


1.00 


1.00 


1.00 


0.31 


0.29 


0.22 


0.18 


1.00 



Table 5: Latent budget model at node 1 1 : G^=8.1 (df=6), AIC*=-3.8, 



model 1 



k=l 

k=2 

total 



i=l 

Co.94) 

0.06 

1.00 



i=2 

0.06 

1.00 



Age 

i=3 

0.00 

1.00 



i=4 

0.47 

(0^5^ 

1.00 



i=5 

0.00 

1.00 



Typologies of users 



1=1 i=2 



0.06 

0.18 



0.21 

CQ.29)| 

0.24 



ok:. 

0.18 

0.35 



m 

0.13 

0.23 



total 

1.00 

1.00 

1.00 
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5. Concluding remarks 

This paper has provided a methodology to grow a classification tree for a 
multiple qualitative response variable on the basis of the latent budget model. 
On one hand, the recursive use of the latent budget model has allowed to extend 
the latent budget analysis to multidimensional contingency tables; on the other 
hand, the tree-growing procedure via latent budget analysis has characterized an 
alternative tree predictor built up by using the bayesian concepts underlining 
the latent budget parameters. It was beyond the aim of this paper to show the 
relations of the proposed method with other approaches to classification with 
prior knowledge of a criterion variable such as supervised neural networks 
(Mola, Davino, Siciliano and Vistocco, 1997), other tree-structured procedures 
(Mola and Siciliano, 1997; Siciliano and Mola, 1997). As main advantages 
resulting from the application on real data sets, the proposed approach has 
proved to solve the multi-class problem of classification trees by improving the 
classification over all the response classes; it provides a fruitful interpretation 
of the tree in terms of conditional probabilities and Bayes rule; it deals very 
easily with multiple questions through the use of compound variables as 
predictors in the model fitted at each node. 

Acknowledgements: This research project was supported by MURST 60%. 
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Abstract: A review of methods for asymmetric three-way scaling is presented 
focusing on their graphical capabilities. A general strategy of analysis is outlined 
with an example of application to import-export data. 

Keywords: Three-way Data Matrices, Asymmetric Scaling, Graphical Methods. 



1. Introduction 

Square data matrices which rows and columns correspond to the same set of 
"objects" occur frequently in applications: proximities (e.g. similarity ratings), 
preferences (e.g sociomatrices), flow data (e.g. import-export, brand switching) 
and contingency tables (e.g. occupational mobility, word associations) are 
known examples. Not completely random asymmetry is often observed in such 
matrices and models taking into account this feature are required. To this 
purpose, multilinear and distance models are suitably modified by increasing the 
number of parameters in order to represent the asymmetry. A review of possible 
methods was recently provided in Zielman & Heiser (1996) for the two-way 
case. In this paper some methods for the simultaneous analysis of k asymmetric 
square nxn matrices (three-way case) are discussed focusing on 

their relationships and graphical capabilities. This will suggest a proposal for a 
possible strategy of analysis based on the trade-oflf existing between fit and 
parsimony of the model. An example of application of the strategy is given by 
using import-export data. 

In what follows the model formulation is specified for metric data, when the 
model can also deal with the non-metric case this fact will be mentioned. Data 
are assumed processed in order to make the matrices coherent with the 
different models (similarities for multilinear models or dissimilarities for 
distance-like models). 



* This research has been made possible by National Research Council of Italy with grant 
96. 01350. CTIO for the first author and 97.01 192.CT10 for the second author. 
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2. Methods for intrinsic asymmetry 

When relationships are mainly directional (intrinsic asymmetry), each datum is 
interpreted as a direct estimate of a single relation and less attention is paid to 
the symmetric component (e.g. journal citation data). 

Most of the multilinear models proposed to deal with this asymmetry are 
particular cases of the following general formulation 

^iiy,^x[R^yj+^ijh ( 1 ) 

where x. and are qx\ {q<n) vectors of loadings respectively for row-object / 

and column-object j in q dimensions (or “aspects”), Rf, is a square matrix of 
order q, representing the underlying asymmetric relationships among the aspects 
at the occasion h, and is the error term. Two sets of latent components 

(aspects) are given for the n objects, when considered as rows and when 
considered as columns. The relationships summarised in R^ between the two sets 
can change across the occasions. The parameters are computed minimising the 
sum of squared residuals, an alternating least squares algorithm could be 
obtained by the TUCKER-2 method (Kroonenberg & de Leeuw, 1980). Even if 
the asymmetric relationships in the data are usually simply represented in the 
Rf, matrices, model (1) has some drawbacks: components for rows and columns 
are different; it does not provide a graphical representation of the objects; it is 
difficult to deal with non-metric data. In the following we list and comment 
some particular cases of model (1) known in the literature, in the equations 
below i? is a square matrix and D/, is diagonal. 



PARAFAC-CANDECOMP 
(Harshman 1970, Carroll & Chang, 1970) 


R,=Df>0 


(2) 


Three-way DEDICOM 
(Harshman 1978, Kiers 1989) 


Xi =yi 


(3) 


Three-way dual domain ("weak") DEDICOM 
(Harshman 1978) 


Rh=DnRD, 


(4) 


Three-way single domain ("strong") DEDICOM 
(Harshman 1978) 


RfrDffiDf,, X,. =y, 


(5) 



Models (3)-(5) in different ways simplify the factorial interpretation: introducing 
(3) we allow only one set of components changing asymmetric relations across 
the occasions; model (4) allows separate sets of components varying 
proportionally across the occasions and having the same underlying asymmetric 
structure. Model (2) is the only one with graphical capabilities, each object is 
represented by two points in a ^'-dimensional space in order to represent the two 
different directions of and by scalar products. On each occasion the q 
dimensions are weighted according to the diagonal entries of Df,. It follows that 
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a representation for the occasion-weights can also be obtained. 

In the multidimensional scaling (MDS) tradition the distance-like models 
proposed for intrinsic asymmetry are particular cases of the General Euclidean 
Model (GEM, Young 1987) that in scalar notation can be expressed as 

- yj ) vjVh -yj)+ ^ijh ( 6 ) 

where x. and are ^-dimensional vectors of coordinates respectively for row i 

and column j, is a qy~q positive semi-definite (psd) matrix of weights 
associated with row object i, Wf, is a qxq psd matrix of weights associated with 
occasion h and Syf, is the approximation error. The particular cases of model (6) 
more relevant for analysing asymmetry are: 

ASINDSCAL X, =>»,., =dmg(V,), WfrdisLg(Wf) (7) 

Weighted Multidim. Unfolding =/, Wf,=diag(W,,) (8) 

Both models provide graphical representations for the objects and the occasions 
and can deal with non metric and missing data. However, while modifications 
produced by the weights ITj, for the q dimensions in model (8) are easily 
interpretable (as in the INDSCAL model), the interpretation of results for model 
(7) seems controversial because of the non-Euclidean geometry induced by the 
introduction of the diagonal matrices V^ . In fact, in spite of the wide availability 
of software (e.g. the ALSCAL procedure in SPSS), ASINDSCAL has been 
rarely used. 



3. Methods for the analysis of symmetry and skew-symmetry 

The decomposition of any square matrix in symmetric and skew-symmetric 
components has inspired many proposals about two-way asymmetric MDS. 
Some of these methods were reviewed in Zielman & Reiser (1996). The three- 
way case has not been so widely explored. We describe two methods for 
representing separately the symmetric and the skew-symmetric components and 
one providing their simultaneous analysis. 

A method for the first case is provided by Zielman (1991). It is based on the 
model 



kijh = 7 (® P - ® flk ) = (fi -rj) + 8*, 



where: 



(9a) 

(9b) 
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Sfjf, and are the symmetric and the skew-symmetric components of the data; 

X, , r, are coordinate vectors of object /; 

Wy, is a psd matrix which can be diagonal (INDSCAL) or not (IDIOSCAL); 

indicates the relative importance of each dimension in occasion h. 

For this model the S 5 mimetric component is represented by the distances in a 
weighted Euclidean space. The skew-symmetry is depicted in a separate 
multidimensional space in which r, describes the location of object i while 
represents occasion h. the projections of the objects-point on the occasion- 
vectors indicate the rank-order of the objects for that particular occasion (as in 
the vector model of preference data, Carroll 1972). The method allows mis sin g 
values. A linear model for asymmetry at each occasion is assumed in model (9b), 
and this could be relaxed allowing a multidimensional representation of skew- 
symmetry by using the PARAFAC model, however the properties of this 
approach need further investigation. 

Finally, a proposal for simultaneous analysis of the symmetric and the skew- 
symmetric components is considered in Rocci & Bove (1994). The elementary 
datum is approximated as 



CO 



ijh 



= d^^dj^x'.Jx.+z^^^, J = diag 



a 1 


r 


1 


r 




1-1 


ij’ 


-1 


1 


5 . . . 

j 



( 10 ) 



For each pair of dimensions; (1,2), (3,4), ..., a “compromise” representation of 
objects in a plane (bimension) is obtained. The scalar product among pairs of 
object-points represents symmetry while twice the area of the triangle they form 
with the origin describes the skew-symmetry, algebraic sign is associated with 
the orientation of the plane (positive counter-clockwise, negative clockwise). 
Object weights allow to display the different occasions by stretching or shr inking 
the distance of each point from the origin in the compromise configuration. The 
total amount of symmetry (skew-symmetry) reconstructed by the model is 
obtained summing algebraically scalar products (triangle areas) in each 
bimension. 



4. A general strategy of analysis 

In the previous sections we have shown that multilinear and distance-like 
methods are hierarchically related. As a matter of fact, they are nested models 
and optimise the same criterion function, but those lower in the hierarchy 
optimise this function subject to more severe constraints than those higher in the 
hierarchy, which results in poorer model fits. On the other hand, the more 
severely constrained methods correspond to simpler models, yielding easier 
interpretations. 

Distance-like and multilinear methods can be related noting that models (2) and 
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(8) are equivalent because the first is the scalar product version of the latter. 
Model (10), under the constraint = d^, can be seen as a constrained version 

of (5), where R=J and = djl , or a particular case of (2), where yj =Jxj and 
= dll . Finally we can summarise the relations among the different methods 
with the following direct graph 




where an arrow goes from a model to a nested one. Some suggestions for 
possible strategies of analysis can be deduced from the afore mentioned 
hierarchy. For example in both contexts, multilinear and distance-like, the more 
parsimonious model could be taken as starting point, possibly with graphical 
capabilities, moving to a more complex one until an acceptable level of fit is 
reached. 



5. An application to import-export data 

Some of the methods previously discussed are now applied to import-export 
data following the indications of the outlined strategy. The flows in US dollars 
(O.C.S.E.) in the three years 1981-1985-1989 are analysed for the following 12 
European countries; Belgium (BE), Denmark (DE), Finland (FI), France (FR), 
Germany (GE), Ireland (IR), Italy (IT), Netherlands (NL), Norway (NO), Spain 
(SP), Sweden (SW), United Kingdom (UK). The entries ( 0 ,^,, in the data 

matrices represent the exportation from country i to country j at year h. The 
iterative proportional fitting algorithm was preliminarily applied to each matrix 
(column and row sums equal to one) in order to remove the influence of 
country size and to emphasise the row-column association, that is the factors of 
interchangeability. Data in each matrix were also centred respect to the overall 
mean to remove the trivial component. Thus the exportation from one country 
to the other is evaluated with respect to the overall mean exportation. The 
diagonal entries of the data matrices were not taken into account because are 
meaningless. 
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We first applied method (10), under the constraint , because of its 

graphical properties and because it requires a minimum number of parameters. 
At first only one bimension has been fitted to represent symmetry and skew- 
symmetry simultaneously by using only n points. Unfortunately the model fit was 
very poor (about 50% of total sum of squares) and we needed two bimensions 
to improve the fit up to 74%. This implies two planar representations of n points 
each, with doubled number of parameters. Moreover algebraic sums of areas and 
scalar products from the two different bimensions are required resulting in 
difficult graphical interpretation. We could circumvent this inconvenience by 
adopting an opportune rotation in the multidimensional space (see Rocci and 
Bove, 1998), but here we will show what was obtained changing model. 



Figure \:PARAFAC-CANDECOMP 



SPO% G 


} 












IR 

0 

UK 
IR + 




NO+^ 

NOX>DE 

$. sw 

FI 0 


UK 

0 









PARAFAC-CANDECOMP model (2) was chosen at this stage for its graphical 
capabilities. The resulting bidimensional configuration is depicted in Fig.l, it 
reproduces 76% of the total sum of squares. Three groups of countries are 
clearly isolated (DE-FI-NO-SW; IR-UK; BE-FR-GE-IT-NL-SP) and we notice 
that usually the within-group relationships are higher than the overall mean 
exchange in the data matrices (positive scalar products) while the between-group 
relationships are less or equal (negative or null scalar products). Each centred 
oriented flow is represented in the display, for instance the centred flow from 
UK to IR is approximated by the scalar product between the points UK(+) and 
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IR(0) while the opposite flow can be analysed similarly by IR(+) and UK(0). 
From their comparison it follows that the skew-symmetry is favourable to IR. 
Moreover the mean exchange can be also considered by averaging the two scalar 
products. Of course the graphical analysis of the symmetric and the skew- 
symmetric components is easier with the methods explicitly based on this 
decomposition while, on the contrary, for them is more difficult to detect the 
oriented flows. The configurations obtainable for the three years applying the 
weights in D^, are very similar to Fig. 1 for the general stability of the import- 
export phenomenon across the time. 




At this point we decided to apply a distance model of the GEM class (6) in order 
to simplify the diagram interpretation (usually it is easier to detect a distance 
rather than a scalar product). After subtracting each entry to the maximum in 
every occasion, the Unfolding model was chosen being substantially the 
PARAFAC-CANDECOMP reformulation in the distance models domain. 
Furthermore it is much easier to be interpreted than ASINDSCAL in spite of 
their equivalence in term of number of parameters. The fit obtained was 62% of 
the total sum of squares and the configuration is depicted in Fig.2. The 
interpretation is now based on point distances rather than on scalar products, 
capital letters are associated to the rows and small letters to the columns. We 
have results similar to Fig. 1 in terms of between-groups relationships and some 
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minor differences within the groups (e.g. in Fig.2 is less evident the skew- 
symmetry UK-IR). 

Finally we outline that when a good approximation is not obtainable by the 
previous models an approach representing separately the symmetric and the 
skew-symmetric components can be conveniently applied. 
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Abstract: The different techniques used for Euclidean approximation of distances 
are discussed. In the special case of points in a Euclidean space, whose distances 
are biased due to measure errors, accepting negative eigenvalues may help in the 
interpretation of results that are less biased than those obtained through an additive 
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1. Introduction: PCA and PCoA 



In exploratory data analysis geometrical representations, individuals are seen as 
points in an affine space E, sustained by the vector space spanned by the observed 
variables. If no metrics is known, variables may be arbitrarily represented on 
graphical Euclidean space and individuals are set accordingly. If some measured 
association exists, reflecting the relationships among either individuals or variables, 
it may be used as a spatial metrics, according to its properties. If the association 
matrix is positive semi-definite (psd), either Principal Components (PCA) or 
Principal Coordinates {PCoA) Analyses give an exact Euclidean representation, 
where points distances are interpreted as dissimilarities among corresponding 
individuals and vectors angles as measures of variables reciprocal agreement 
(similarity). In this frame, reduced dimensional representations are effective for 
information synthesis and main factors detection. 

When negative eigenvalues occur, i.e. when the available association matrix is not 
psd. Euclidean representation becomes critical, since negative eigenvalues may not 
be interpreted as points inertia along corresponding eigenvectors. Instead, an 
Euclidean approximation is always possible through different techniques. 

A similarity matrix S among variables is a symmetrical bilinear form having maxima 
on the diagonal; if S is psd it induces a scalar product, that gets the space E 
spanned by the variables an Euclidean space. Thus (Lang, 1972), orthonormal 
bases may be formed and eigenanalysis of 5 = IfAU, with constraint U'U = I, once 
sorted the eigenvalues in descending order, solves the optimization problem 










minimum V q<p 



( 1 ) 
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where p is the order of S,A„e A, e U. The loss of information is minimised 

by projection of both vectors and points on the reduced dimensional spaces, 
sp^ned by the first q eigenvectors of S. It may be measured, since 
5Ta ~ trace (5), and, given the « x data table X concerning p variables 

measured on the n individuals, is the points inertia along a-th axis, 

CigB C = XU, being the a-th coordinate of /-th individual in the space basis formed 
by the eigenvectors. Then, the ratio / trace(5) is a relative measure of inertia 

along the a-th axis. These are the essentials of Principal Components Analysis. 
Given an n'>^n distance matrix D among individuals, Torgerson's (1958) formula 

- 4 - 4 ^ ( 2 ) 



for all ij = defines a similarity between vectors associated to considered 
points, whose origin is at the points centroid. If the corresponding matrix S is psd, 
the distance is Euclidean and S may be used as a scalar product, having as distance 
among points D itself S may be submitted to Principal Coordinates Analysis 
(Torgerson, 1958; Gower, 1966, Mardia et al, 1979); S eigenanalysis under 
constraint UU = A, provides an orthogonal basis with the same meaning as PC A. 



2. Theoretical PCoA and distance limits, with solutions 



Although partitioning points inertia, PCoA does not lead to the best least squares 
approximation of distances in reduced dimensional spaces, since the problem is now 



d} (u .)^ 

ij ^ OLi CLj' 



= minimum 'i q<n 



(3) 



that may not be solved through eigenanalysis, nor solutions in different dimensional 
spaces are encapsulated as in the scalar product case (Le Calve, 1976). Thus, PCoA 
application should be limited to cases in which, far fi"om aiming at minimizin g 
distances bias, one wishes to use available distances to detect angles among vectors, 
in agreement with given distances. Aiming at maximizing distances approximation 
in reduced dimensional spaces, numerical techniques (such as Non-Metric 
MultidimensiorKd Scaling, NMMDS, Kruskal, 1964a, b) lead to local optima. With 
NMMDS all existing metric information, such as factors and points inertia, if any, 
is lost, since it deals with dissimilarities ranks. Both Gower (1966) and Seber 
(1984) suggest the use of PCoA as an initial guess for NMMDS, but Kruskal 
(quoted by Seber, 1984) admits that it does not change the initial configuration very 
much. Then, one may question the convenience of such an attempt, particularly in 
exploratory analyses, when a metrics exists. 
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If the given distance is not Euclidean, the matrix is not psd, so that in eigenanalysis 
negative eigenvalues result: this is usually considered a serious drawback for 
eigenvalues interpretation in terms of explained inertia, yet, the interest in an 
Euclidean cqiproximation remains, at least for descriptive purposes. One may skip 
the problem by using NMMDS, or override it by adding a suitable constant to dl 
distances (Lingoes, 1971; Cailliez, 1983). In particular. Lingoes (1971) solution, 
given hy d^+c, c = 2 |A,„| where X„ is the negative eigenvalue with maximum 
module, biases individuals pattern. No theoretical improvement derives from 
Cailliez (1983) exact solution dy + c, where c is the largest eigenvalue of a 
particular matrix. Since all distances are biased, Messick and Abelson (1956), 
Gower (1966), Saito (1978), Mardia (1978), and Critchley (1980) argue that 
Torgerson solution, ignoring negative eigenvalues, is to be preferred, provided that 
they are small. In addition, Mardia (1978) shows that Torgerson solution limited 
to positive eigenvalues has some optimal properties. Additive constant technique 
seems then a suitable alternative to NMMDS, with the advantage that the algorithm 
is faster and leads to a global optimum, instead that a local one. 



3. A special case and a suggested solution 

Empirical data are always subject to both measure error and rounding, so that a 
distance matrix may be only an estimate of the true one. In the following, we 
compare the additive constant method to a deeper insight to Euclidean approxi- 
mation in the case of negative eigenvalues. This will be done in the special case in 
which a) individuals actual pattern is supposed to belong to an Euclidean space, b) 
fectors accountable for individual scattering are supposed to exist, and c) distances 
are non-Euclidean due to measure errors. A robust analysis method not too much 
dependent on these errors would be precious. In this case, application of NMMDS 
would lose all metric information and additive constant technique axes and inertia 
would not be directly related to original pattern, since distances would be biased. 
In particular, the original true inertia would be overestimated. 

Let us consider the eigenanalysis of a non-psd matrix, with eigenvectors constraint 
being 






1 if a = P and X>0 

-1 if a = P and X<0 

0 otherwise 



(4) 



and call it for simplicity plain analysis. Thus, eigenvectors corresponding to 
negative eigenvalues are imaginary, as well as individuals coordinates (Gower, 

1985). Nevertheless, formula is still valid, but K partitioned 

into two parts: 1) a positive one, that overestimates the true inertia, with Euclidean 
real representation, i.e. an Euclidean approximation of individuals on the space 
spanned by real eigenvectors; and 2) a negative one, that fixes the overestimated 
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part, and provides an imaginary representation (on the imaginary eigenvectors) of 
the bias introduced in the Euclidean approximation. Limited to this imaginary 
eigenspace, coordinates, contributions, distances, and inertia keep the same 
meaning of the real representation, with the difference that these are referred to the 
introduced bias. It is then a principal coordinates analysis of introduced bias. 
Considering a distance matrix D, if the corresponding S Torgerson's matrix is not 
psd, the equivalent tool is the S eigenanalysis with constraint U'U = A, including 
negative eigenvectors. In order to give a geometrical meaning to this technique, 
one must consider that in Euclidean spaces non-Euclidean geometric figures «do not 
closer. In order to enable points representation, thus to close figures, some 
distances need to be enlarged. Consequently, the original inertia is increased and 
becomes overestimated. The added inertia does not exist in the data nor it may 
appear in trace (A) = trace (5). Then negative eigenvalues and imaginary axes are 
necessary for inertia balance; having negative contribution, they explain where and 
how much extra inertia was added, in order to allow the Euclidean representation. 



4. Numerical examples 

Let’s start with a very simple example, considering a square of unit area, with all 
vertices lying on the axes. Their coordinates are thus either 0 or ±1 each side 
length is 1, and diagonals length is sfl. Inputting the computer^ as a double 
precision number, the eigenvalues of the corresponding Torgerson matrix are 
correct and both points coordinates and interpoint distances are sufficiently well 
approximated. Giving the computer^ with only 4 significant digits, say either 
1.414 or 1.415 a third non-zero eigenvalue is obtained, negative in the latter case. 
In Tab. 1 are compared the essential results of PCoA on both the exact distance 
matrix and the one with rounding to 1.414 and both plain analysis and PCoA with 
additive constant on distance matrix with rounding to 1.415. In the first column 
each point distance fi’om the centroid on the plane of first two axes is given; in the 
second the coordinates along the third, if existing; in the third column the 
Torgerson’s matrix trace, in the fourth the points inertia on the first two 
eigenvectors plane, in the fifth the bias introduced in the side measures and the last 
the one introduced in the diagonals length. In case of rounding to 1.414, in order 
to keep the side length to 1, the points connected by a diagonal (1 and 4) are 
opposed on a third axis to the others (2 and 3), in order to fit the square sides 
length, since otherwise the reduced length diagonal would shrink the sides too (Fig. 



Table 1 : Main results of four points analyses. 



Matrix 


distance 


III axis 


trace 


plane 1-2 


bias 




from origin 


coordinate 




inertia 


side diagonal 


Mfl 


.7071 




2.0000 


2.0000 




1.414 


.7070 


± .0086 


1.9997 


1.9994 


-.0003 0 


1.415 plain analysis 


.1015 


±i.0167 


2.0022 


2.0011 


+.0011 0 


1 .4 1 5 additive constant 


.7079 




2.0045 


2.0045 


+.0022 +.0022 
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1). In case of rounding to 1.415, 
diagonals are too long to allow the 
side length 1, so that it becomes 
1.00055. This biased distance is fixed 
on the fourth axis, an imaginary one 
with negative eigenvalue. Here the 
squared diffe-rence between 
imaginary points coordinates is 
negative and its value is exactly the 
one necessary to restore distance 1. 
Numerically, only sides lengths are 
biased and information is found on 
the fourth (imaginary) axis. If 
possible, a graphical representation, would be analogous to that in Fig. 1, with the 
difference that the imaginary vertical axis would show which couple of points 
distance was biased due to Euclidean representation needs. Using additive constant, 
all distances are biased, so that additive constant representation is more biased than 
plain solution, as Mardia (1978) proved. 

Let us consider now a grid of 16 points in a 2-dimensional space, equally spaced, 
so that each grid's side is 2. PCoA returns the exact grid on the eigenspace, cor- 
responding to two equal eigenvalues, that may be arbitrarily rotated. Modifying 
points 1-2 distance to d ,2 = d 2 , = 2.1, D matrix is no longer psd. Lingoes (1971) 
solution adds to each squared distance the same quantity, .4743589... and the 
correct representation of this biased structure is in 14 axes (i.e. the number of 
points - 2), where each biased distance is exactly represented. In Tab. 2 is shown 
the essential bias information: on the first column the matrix trace, on the second 
the inertia on the first two axes plane, on third and fourth the coordinates of points 
1 and 2 on the third axis, and on the following maximum bias on plane 1-2 of both 
point position and distances considering points 1-2 and others separately. 

In Fig. 2 the grid on plane spanned by axes 1 and 2 (plane 1-2 in the following) is 
shown; rotation results from PCoA. On this plane, points 1 and 2 are those best 
represented, as well as their distance compared to all others. A third small axis 
opposes points 1 and 2 (the others remain close to the centroid), so that in three 
axes the (biased) distance 1-2 is completely represented. Other points and distances 
are adjusted on the following 1 1 axes, totally unuseflil. Compared to the unbiased 
solution, all distances are biased by the same amount. 

Using plain analysis, only four non-zero eigenvalues result. On Tab. 2 one may 



Table 2: Bias measures of grid representation through additive constant technique 
and plain analysis. 



Matrix 


trace 


plane 1-2 


axis 3 coord. 


points bias 


distances bias 






inertia 


1 


2 


1-2 others 


1-2 


others 


Unbiased grid 


160. 


160. 












Additive constant 


160.7133 


160.1636 


.377 


-.403 


.0083 .005 


.3282 




Plain analysis 


160.0256 


159.9386 


-.302 


.32 


.0095 .004 


.3338 


.1857 



Figure 1 : Representation of four points on 
three-dimensional space, when diagonals 
are rounded 
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Figure 2: The 16 points grid on the 
plane of first two axes, with additive 

constant technique. 
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Figure 3: The 16 points grid on the 
plane of axes 3 (real and 16 

(imaginary). 
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notice that all bias indicators are lower than those of additive constant. On the first 
two, the grid is again well represented, approximately with the same rotation than 
additive constant. On the third axis, points 1 and 2 are opposed and inertia on three 
axes overestimates the trace. The imaginary axis shows this overestimate. In fact, 
on this axis points 1 and 2 are opposed mostly to 3, 5, and 6, that is those closest 
to them, since their distances were mostly biased in order to adjust the biased 1-2 
distance. Considering distances, on the plane 1-2 all those not involving points 1 
and 2 are nearly exactly represented, whereas the others are somehow under- 
estimated. Summarizing, plain analysis gives less biased results than additive 
constant, in an essential number of axes, on plane 1-2 a better estimation of both 
points position and true distances is given, and, in addition, the bias introduced to 
distances is shown clearly by the imaginary axis (Fig. 3). 

Let us now bias two distances, 1-2 and 15-16, to d,2 = d,s,^ = l.l.ln this case, 
plain analysis gives 6 non-zero eigenvalues, two of which are negative. The pattern 
of points on the plane spanned by the first two axes is nearly the same as before. On 
the plane 3-4 (Fig. 4), points 2 and 15 are opposed to points 1 and 16, meaning that 
all these distances were originally biased. On the imaginary plane 15-16 (Fig. 5), 
opposition among points 1,2, 15, and 16 to all others, means that all these distances 



Figure 4; The 16-points grid with two 
biased distances on the plane of real 
axes 3 and 4. 
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Figure 5: The 16-points grid with two 
biased distances on the plane of 
imaginary axes 15 and 16. 
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were biased for representation purposes, in particular those among points 1, 2 and 
the closest to them (on the second quadrant), and those among points 15-16 and 
the closest to them (on the first). In this way all bias introduced in order to fit 
Euclidean representation is visible. 



5. Conclusions 

The problem of the Euclidean approximation of points based on an association 
matiix, when necessary, may be solved in different ways. Both PCA and PCoA are 
exact solutions, provided that the association matrix is psd. Considering distance 
matrices, when this is not the case, an approximation may be done according to 
several techniques: NMMDS is always possible, but the algorithm is time consu- 
ming, initial configuration dependent, leading to local optima, and losing metric 
information, if present. The additive constant technique is competitive with NM- 
MDS, since it is based on eigenanalysis, so it is faster and leads to a global solu- 
tion, that may little differ fi'om NMMDS. Nevertheless, no advantage comes from 
its metric structure, since it is biased in respect to the original one. Critchley (1980) 
proposes a general formula for non-Euclidean distances, consisting in transforming 
distances according to some monotonic function. A psd matrix is obtained through 
an additive constant-like transformation that is proved to have optimal properties. 
In this frame, Joly and Le Calve (1986) proved that, given any non-Euclidean dis- 
tance d^ there exists a maximum exponent e, called Euclidean Index, such that the 
Hadamard power a of matrix D, D “, is psd for every a such that 1/2 ^ a ^ e < 1 . 
The index e may be found by iteration and has optimal properties, since its bias of 
original distances is minimum. In addition, e may be chosen as a good indicator of 
the considered distance nature, since it informs about the distance departure from 
Euclideanity. The use of an Euclidean index may lead to more interesting re-sults, 
both theoretical and practical; considering distances obtained through a for-mula, 
by including the found index in it. Euclidean distance would be get directly. 

In the case of biased Euclidean distances, Joly and Le Calve (1986) proposal seems 
of reduced interest, whereas the use of plain analysis leads to more natural and 
easily understandable results than additive constant. Compared to the latter, plain 
analysis less biases distances, seems simpler, more straightforward, and more 
robust, outlining the true structure and not the biased one. In addition, it shows 
exactly the extra bias, necessary to get the Euclidean approximation. For this 
reason, once the computed distance is suspected to be Euclidean, it may be a 
helpful tool for detecting distances measure errors or, if distance is computed in 
some way, which may be the couples of individuals responsible of the bias. In the 
spirit of exploratory analyses, this seems a very interesting opportunity. 

As a final remark, since Benasseni (1994) proposes a method aiming at adding a 
constant only to some distances, i.e. those suspected to non-Euclideanity, it may 
be interesting to compare it with plain analysis. It is likely that the results obtained 
through plain analysis may be used as an input for Benasseni procedure. 
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Abstract: In this paper a very general way of modelling dissimilarities is pro- 
posed based on ideas derived from multigraph theory. The proposed method 
admits Gower's dissimilarity index as a special case and gives the possibility to 
cope with measurement errors and within subject variability when computing 
dissimilarities. The approach also allows to assign a different importance to the 
same dissimilarity value on different regions of the variable domain. 

Keywords: Dissimilarities, Multigraphs, Cluster Analysis, Ordination Methods. 



1. Introduction 

Many of the classical multivariate statistical methods - such as cluster analysis 
and multidimensional scaling - require proximity matrices as input and their 
performances are heavily conditioned by how accurately the computed prox- 
imities describe the real differences among the observed units. 

This is perhaps one reason for the large number of similarity or dissimilarity 
measures that can be found in statistical literature for numeric or binary vari- 
ables. Lists and discussions of their properties are presented by, amongst others, 
Gordon (1981), Gower (1985), Gower & Legendre (1986), Snijders, Dormaar, 
van Schuur, Dijkman-Caes & Driessen (1990), Everitt (1993) and Cox & Cox 
(1994). 

When one has to deal with mixed variables, the range of possible choices nar- 
rows down. The most widely - and perhaps the only - suggested solution is 
Gower's dissimilarity coefficient (Gower 1971) defined as dij=\-Sij (i and j be- 
ing two generic units) where 

and where Siji^l-bcik-xjkVRangek for numeric variables and Sijk^l or j,;*= 0 for 
binary and nominal variables according to whether the two units show the same 
or a different value on the k-th variable; 5ijk is typically 1 or 0 depending on 
whether or not the comparison is valid for the /:-th variable. 

A different way of analysing mixed data has been proposed by Godehardt 
(1990) in the cluster analysis context and is based on graph-theoretic concepts. 
For each variable (or block of variables taking values on the same measurement 
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scale and phenomenally linked) a specific similarity or dissimilarity measure is 
calculated between every pair of units. Being m the number of variables (or of 
blocks), this gives m dissimilarities between any pair and defines a multigraph 
with the n objects as vertices. Every variable (or block) is thus considered as a 
layer of a related multigraph. Coherently with cluster analysis goals and with 
linkage methods philosophy, a pair of vertices in a layer is joined by an edge if 
the dissimilarity between the corresponding objects is less than a specified 
threshold d which is generally different for the different layers and is dependent 
on the dissimilarity measure that has been used to evaluate differences on the 
variable (or the block) corresponding to the layer (for blocks containing a single 
quantitative variable Godehardt suggests to choose d d& & function of the vari- 
able standard deviation but no further advice is given neither as to which func- 
tion has to be used nor as to what choice is suitable for nominal variables and 
for blocks comprising more than one variable). Cluster analysis is then per- 
formed on what Godehardt calls the ^-projection of a multigraph, that is a graph 
having the same vertices as the multigraph and in which two vertices are linked 
if they result joined by at least s edges in the multigraph itself. 

This approach to modelling dissimilarities gives interesting results as far as 
cluster analysis is concerned but may not be completely satisfactory when the 
same data have to be analysed by ordination methods too. In fact while cluster 
analysis aims at dissecting the data into homogeneous groups, ordination aims 
at graduating dissimilarities between units (Krzanowski & Marriott 1994) and 
therefore, in this context, the choice of a single threshold for each layer may 
sometimes conceal important features. 



2. A weighting procedure for dissimilarities 

Godehardt's model for dissimilarities can be modified to allow a more detailed 
unit description by weighting the edges of the multigraph, that is by giving 
edges a weight that is linked to the corresponding dissimilarity in a linear or 
non linear way, according to the amount of information one has about the units 
and the variables and wants to be conveyed into the analysis. 

To put it into an operational perspective, for the ^-th variable (or block of vari- 
ables) tk thresholds udi< ... <kdi< ••• <kdtf^ (i e. a different number of thresholds 
can be used for different variables) and tk weights kVj< ... <kVi< ... <kVtk are de- 
fined such that, denoted as dijk the dissimilarity between the units i and j on 
variable k, if dijk<kdi, then *v;=0; if dijk>kdtk, then kVtk=U else kdi^ijk<kd(i+i), kVi is 
set equal to a user defined value that is generally different for different I’s. 

In so doing the dissimilarities (however computed) are mapped onto the 0-1 in- 
terval, with the effect of giving the same weight to all the dissimilarities falling 
between the same thresholds and of allowing non linear weighting of dissimi- 
larities within the same variable. Different weights can also be given to differ- 
ent variables by simply changing the width of the interval onto which dissimi- 
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larities are mapped: for instance, a variable whose mapping interval has been 
set equal to 0-2 is given a double weight with respect to the remaining ones. 

The suggested procedure is very general and can be applied to all the dissimi- 
larity measures usually examined in the statistical literature (see Cox & Cox 
1994) for quantitative, qualitative and dichotomous variables. 

Once the multigraph has been constmcted in the previously described way, 
proximity analysis can be performed on its projection in which a pair of vertices 
is joined by an edge whose weight equals the sum of the weights of the corre- 
sponding edges in the multigraph, divided by the sum of the weights given to 
the different variables (the sum equals m if all the variables are equally 
weighted). 

Multigraphs represent an elegant model for dissimilarities, but are not the only 
way in which the procedure we are describing can be formulated. Alternatively 
one can consider m dissimilarity matrices (one for each variable or block of 
variables), transform them in as many matrices V* whose entries are defined as 



V 



ijk 



0 if d..,<,d, 



where /t is a monotone increasing function of the observed dissimilarities, and 
synthesize them in a single matrix A whose entries are given by 



aij='LkVijkWkT^kWk 



where Wk is the k-th variable weight. 

The crucial point of the method is the choice of the thresholds and of the 
weighting procedure. One can either devise a procedure for determining the 
thresholds (linearly or non linearly depending on the research aims) and then 
weight differences linearly, or choose equally spaced thresholds and devise a 
weighting criterion, which gives linear or non linear weights depending on the 
character nature and the exploratory goals. From the result point of view the 
two approaches are not dual to each other. For a given value of the maximum 
difference allowed between two units in order for them to be considered as 
identical, the first one produces a small number of intervals of increasing (or 
decreasing) width and all the differences within each interval are equally 
weighted; in so doing it highly flattens “between subjects” variability. If one 
wants to cope with measurement errors while preserving data variability the 
second approach is preferable; in the following we will describe it in detail. 

The first step to be made is the choice of a constant e*, for each variable or 
block of variables, which just measures the maximum difference allowed be- 
tween two units in order for them to be considered as identical (EkFkdi). After 
that, the number of equally spaced thresholds can be determined, for variables 
measured on any measurement scale as well as for blocks of variables, as 
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te= int(kdMAx/^k) 

where int(a) denotes the integer part of a, that is the largest integer less than or 
equal to a (e.g. int{2.1)=2), and tJMAX=MAX{<ip}. The weighting function fk 
can then be chosen within the monotone increasing convex functions, if one 
wishes to give differences an importance which is less than proportional to the 
distance values, within the monotone increasing concave functions, if the im- 
portance is to be more than proportional, within straight functions, if difference 
importance is to be proportional to their values (fig. 1) (see Anderberg (1973) 
for a slightly different proposal). Once the weighting function has been defined, 
Vijk is determined as 



Vyit— )^t^(int(<iyt/£fc)) 



where is a normalizing constant which compels Vyj^to lie between 0 and l\ 

Figure 1: Weight behavior for three different choices of the weighting function: 
Viji^(int{dijk/ek))^/tk^ {solid line), Vijt=int{dijkltidlh {dashed line), 
Vijk={int{dijjJZk)f'^!tk'^ {dotted line) 




In order to fix ideas the following simple example may be of help. Let's con- 
sider five houses on which three variables have been observed: the surface area 
{Xi in m^), the presence of a garage {X 2 , l='yes', 0='no'), the location of the 
house {X 3 , l='town centre', 2='suburbs', 3='country'). The data matrix X is given 
in fig. 2a. 



* If the chosen function fk does not pass through the origin, it has to be translated to gk so as 

^t(0)=0. 
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For the three blocks (one for each variable) ey=5, Z 2 = 2 duAK, ^s-sdMAX have been 
chosen, thus giving ti=\'i, t 2 =t 3 =l (in these last two cases there is no need to 
choose any weighting function). The weighting function for X] has been defined 
as fi(int(diji/Bi))=(int(dijilei)f' where diji=\xu-Xj]\, thus overweighting large dif- 
ferences, so Viji=(int(diji/ei)f'/ti^. The multigraph in fig. 2b illustrates the 
weights Vijk for each variable and each couple of units, and the weighted graph 
in fig. 2c the corresponding dissimilarities aij obtained by using wi=2, vi' 2 =W 5 =l 
as weights for the three variables. No edge ha^ been drawn if Vijif=0 or ay=0. 



Figure 2: A simple illustrative example of the proposed procedure 
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3. A further weighting procedure for single variables 

The weighting procedure illustrated in the previous paragraph can be modified, 
for blocks composed by a single variable, to assign a different weight to the 
same dissimilarity value on different regions of the variable domain. This pos- 
sibility (not contemplated in the usual dissimilarity measures) could be very 
useful for some variables. For example, the importance of a given difference 
(e.g. £ 100000) between the monthly incomes x, and xj of two individuals i and j 
when Xi and xj are both high is not so high as when x, and xj are both low: a dif- 
ferent weight should be assigned to the difference on the basis of the position of 
Xi and Xj in the income domain. This result can be obtained by means of the 
following weighting procedure. 

The first step to be made is the construction of a partition {/i,..., /*,..., /^} of 
the domain [kXmin, kXmca\ of the variable Xk, where h=[kXh-i, kXh) is the generic 
class of the partition, with kRangeh=kXh~kXh-i, and kXcFkXmin, kXg=kXmax- After that, 
a constant k^h for each class h has to be chosen, thus giving the corresponding 
number of equally spaced thresholds kth=int(jJRangehl^id, and an eventually dif- 
ferent function tfh for each class /*, whose normalizing constant is i^h, defined. 
The computation of the weight then depends on the classes which the units i 
and j belong to: 
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^ijk = ^ 
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where r>l, k^k=kRangeh/Rangek, and it has been supposed, without loss of gen- 
erality, Xik^jk. 

In this procedure, different constraints on the *£*‘8 and the tpft‘s have to be im- 
posed for /i=l,..., g according to whether one wishes to overweight or under- 
weight large differences: in the first case ifih^if-h+i and k^h^k^h+i (thus kRangen^ 
kRangCh+i), in the second one ifih^ifih+i and kh^kh+i (thus kRangeh>kRangeh+i). 
The following simple example helps clarifying this further weighting procedure 
and illustrate its effectiveness. Let’s consider six families belonging to the 
group of two-person families for which the monthly income has been observed: 
the six values, together with the group minimum and maximum, are illustrated 
in fig. 3. 

Figure 3: Six monthly incomes (xih in thousand £) classified into three classes: 
h=[500, 2000), h=[2000, 5000), h=[5000, 10000] 
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As it can be seen, two families are relatively poor, two relatively rich, and two 
intermediate ones, and the so formed three couples of families have exactly the 
same income difference. In order to give income differences an importance 
which is less than proportional to their values, three (one for each income situa- 
tion) weighting functions ifh have been chosen within the monotone increasing 
convex functions, and the three income classes /* have been defined so as to 
satisfy the constraints kRtingeh^kRttngeh+i. All the parameters chosen in this 
simple example are listed in table 1. 
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Table 1: Parameters used to compute the dissimilarities for the example offig- 
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Table 2 shows the dissimilarities Vy* computed by means of the proposed pro- 
cedure for each couple of the six families. By the inspection of the values it 
clearly emerges that the three previously described couples of families are not 
considered as equally dissimilar; the same thing occurs when the comparison is 
for instance between vjsk and V 2 .#*. 



Table 2: Dissimilarity (upper triangular) matrix \kfor the example of figure 3 
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O 4 










0.0614 


0.0695 


O 5 












3- 10'® 


Os 












0 



4. Conclusions 

The suggested approach is very general as it admits Godehardt's model for dis- 
similarities and Gower's coefficient as special cases. In fact, if a single thresh- 
old is chosen for each variable, one obtains Godehardt's model; if on each layer 
dissimilarity is evaluated following Gower's suggestions (as many blocks as 
variables are considered), the number of different thresholds within each vari- 
able is set equal to the number of different dissimilarity values (rt=n(n-l)/2 for 
k=\, ..., m), the weights are made to correspond to the dissimilarities them- 
selves, sorted in ascending order (kdi is set equal to the /-th ordered dyk com- 
puted dissimilarity value, for 1=1 , ..., tk, and the weighting function is simply 
the reverse of the variable range), and all the variable weights are set equal to 
one, Gower's solution is obtained. 

The appeal of the suggested solution is its versatility. It allows to analyse mixed 
variables using for each of them the proximity measure one deems best, thus 
freeing the researcher from sticking to Gower's suggested indices and it doesn't 
require variable transformation in order to synthesize dissimilarities expressed 
in different units. Furthermore it can Cope with measurement errors and treat 
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characters with a high within-subject variability by equally weighting slightly 
different dissimilarities; a possibility which is not contemplated in either 
Gower's index or in the multivariate methods (such as D4DSCAL) which can be 
used, from a different perspective, to synthesize the m dissimilarity matrices de- 
fined with respect to the m variables. 

A further interesting aspect of the threshold based procedure lies in that it can 
also be successfully applied to single variables, for which distances of the same 
entity located at different positions of the variable domain assume a different 
importance. 

Moreover a suitable choice of weights can prevent from considering equally 
dissimilar two units which differ in all layers or which significantly differ in 
only one of their variables and are similar in the remaining m-\. 
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Abstract: Wider research aimed to obtain timely evaluations of the Professional 
Market in the province of Naples through in-depth interviews with experts. This 
paper shows some results on one set of professions, according to DS and MDS 
procedures. 
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(MDS). 



1. Introduction 

The purpose is to obtain forecasts ranking evaluations (year 2005) about given 
professional sets, through interviews with a group of experts. These subjective 
evaluations were preferred to accurate quantifications which employ traditional 
methods, require analytical and sophisticated procedures, are hard to carry out 
and are not always reliable. 

The reference geographical area is the province of Naples. 



2. Data Collected from Experts 

This research was based on interviews with qualified persons in most of the 
professional fields examined, called privileged interlocutors. These are experts 
who have either studied or gained wide professional experience in the 
professions under study. 

The size of the group cannot be too high because of the quali-quantitative nature 
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of the interview; empirical results show that 15-20 participants is enough in all 
situations. In this research the number of experts to interview was twenty. 



3. The Construction of Item Evaluations 

The questionnaire was constructed to get forecasts on professional 
classifications, in the near future (year 2005), in the province of Naples. About 
each homogeneous list, relative scores for each possible pair of professions 
were sought, that is, for each pair of professions, according to the Saaty scale, 
we asked experts to indicate how many times one profession is greater than or 
less than another one, regarding particular aspects (difficulties to overcome and 
future development perspectives). The Saaty scale is composed by all the whole 
numbers included in the range 1-9 and by the relating reciprocal numbers, 
inverting the object of evaluations reference. This scale derives from empirical 
studies which have demonstrated that the maximum ability of simultaneous 
comparisons of a person may range from five to nine objects comparison. 



4. Binary Evaluation Matrices as Input for Multivariate 
Methods 

The need to have evaluations for each pair of professions led to the compilation 
of a matrix for each group of professions being directly compared, having on 
the rows and in the columns the same professions to compare. This matrix can 
be defined as a binary evaluation matrix and it is very useful in surveys. In fact, 
we may sometimes prefer to obtain binary comparisons between each 
profession and the others, rather than asking respondents to grade a number of n 
professions according to their importance regarding the evaluation criterion. 
The generic element (/</) was the value attributed to the profession of the i'* 
row when it was compared with the f' column profession, that is how many 
times the i'* profession is more or less than the one, regarding particular 
aspects (difficulties to overcome and future development perspectives). Of 
course, if the first profession is more than the second one, atj will be a number 
from 1 to 9 (Saaty scale), otherwise the reciprocal value, from 1 to 9, that will 
be in the range 0 -r 1. In other words, the matrix so defined is an n-rowed square 
matrix, called reciprocal, as it satisfies the following conditions: 
ay= 1/aji, each element is equal to the reciprocity of its symmetric element; 
aii= 1, the diagonal elements, resulting from the comparison of each element 
with itself, are equal to one. 
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5. Multivariate Methods 

5.1. The Dominant Eigenvalue Scores (DES) Method 

According to the Dominant Eigenvalue Theory, a matrix so defined has all the 
eigenvalues null, except one, equal to n, therefore it is called dominant. The 
elements of its related eigenvector, also called dominant, are the scores to 
attribute to each profession because they express the importance, in normalised 
terms, implicitly given to each profession in the comparison attribution of 
scores. This means that the scores forming the list are independent of any unity 
of measurement. So, the importance of DES derives from the possibility of 
obtaining a professional classification indirectly, that is, without asking 
interviewees, but extracting it from professional pair evaluations, which are 
codified judgements and are easier to obtain from the Saaty scale. 

Thus, dominant eigenvector elements can be used in computations to give each 
profession a concise score. 

The theoretical model, on which the DES is based, can be showed through a 
simple example. Considering a number of n professions, whose normalised 
scores are given by the vector w = (wi, W 2 , ..., w„), with = and supposing 

that only the score ratios aij=w/wj of the professions are known, we can 
construct a score ratio matrix: 



Wj/Wi 


VV/Wj . . 




W2/W1 


W2/W2 . 


. . W2/W, 




W„/W 2 . 


. . w /w 

n' r 



A is a reciprocal matrix, because atj= l/uji. The positivity of its elements can be 
easily demonstrated. Then, the normalised scores of the professions compared 
can be obtained resolving the following equation system in the unknown w: 

Aw = A.w (1) 

System (1) can be also written as follows, according to eigenvalue theory: 

(A-Xl) w= 0. (2) 



It is easily recognisable as an eigenvalue problem, which has a valid solution if 
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and only if n is an eigenvalue of A; w will be, in this case, the corresponding 
eigenvector. 

Three results of positive matrix theory can be set out: 

1. the theorem of Perron-Frobenius states that only one dominant eigenvalue 
exists for positive matrices; 

2. as matrix A has a rank equal to one (because each row is a constant multiple 
of the first one), all its eigenvalues are null except one; 

3 . the eigenvalue sum of a positive matrix is equal to its trace (that is the sum 
of its diagonal elements). 

With reference to the third point, this will be: au= w/wi=l and, then: 

= tr(A) = 

In light of what has been said on positive matrix properties, only one of the 
eigenvalues Xi, defined Xmax, will be equal to n, with Xi = 0 for Xi Xmax- The 
related eigenvector w will be derived from the normalisation of any column of 
A. Thus, the solution of the equation system (1) or of the corresponding 
eigenvalue problem (2) enable us to calculate the vector of normalised scores 
from the matrix of binary evaluations. Therefore, according to these principles, 
from binary evaluation matrices we can derive important normalised scores to 
ascribe directly to the professions being compared. 

5.2. Dimensional and Multidimensional Scaling (DS And MDS) 
Methods 

Besides deriving scores, binary evaluation matrices are also the basis for 
projecting professions in a two-dimensional context, highlighting associations 
and contrapositions. 

The techniques applied were DS, which uses as its input the entire single expert 
binary evaluation matrix and MDS, which uses as its input all of the experts’ 
binary evaluation matrices, in order to synthesise them. The latter technique, in 
particular, enable us to have a comprehensive description of individual analyses, 
highlighting, at the same time, relations between professions which would be 
unlikely to emerge from individual evaluations. The choice of a 
multidimensional statistical analysis technique, which is very sophisticated, was 
influenced by the nature and the availability of data, besides by our objectives, 
which were: 

a) to highlight the similarities and differences among professions according to 
the importance attributed to them by the group of experts; 

b) to point out the (unknown) factors which most influence the evaluations 
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given by each expert because the available data are proximity judgements 
about couples of professions expressed in conformity with a quantitative type 
of scale. 

The DS technique uses factorial methods, more precisely, it is included in 
individual difference techniques. Given the quantitative nature of the data, due 
to the application of a special evaluation scale, a method based on the sorting 
and not a metric method was chosen. 

It consists of a representation of professional positions in a smaller space 
dimension (in general 2), showing the unknown factors on which evaluations 
were based. Besides synthesis of individual evaluations, carried out using the 
multidimensional-scaling technique, a separate representation of experts was 
made in the same space, showing the contributions given by each of them. 

This type of analysis enable us to pick out the particular features and the 
conunon characteristics of the various professions, taking into account that 
reducing an n dimensional space into a simplified one with only two dimensions 
causes necessarily the loss of a part of the variability of the phenomenon 
analysed and that two elements appearing near to each other in the two 
dimensional space would not be the same as in the original space-dimension. 



6. Some Comments 

Tables 1 and 2, the first referring to difficulties to overcome, the second to 
future development perspectives, highlight, on the one hand, a comparison 
between DBS analysis and similarity-dissimilarity analyses and, on the other 
hand, a comparison between expert n. 5’s results and those deriving from the 
whole group of experts. 

Before reading these tables, we need to make some observations. First, the DS 
and MDS techniques cause the loss of a certain degree of the original 
variability, caused by the choice of a smaller number of dimensions than the 
original one. Thus, from this point of view, the score analysis, which is not 
affected by this problem, is preferable. 

Another distinction can also be made regarding similarity and dissimilarity 
analyses. In fact, the results obtained with the MDS technique are more reliable 
than the ones referring to the single expert, because MDS includes an iterative 
procedure whose stop criterion is connected to the stress value, which may not 
overcome a fixed threshold value. Furthermore, this technique gives the rate of 
original variability captured by the two factorial axes. In this research reference 
is made mainly to the first factorial axis, because of its greatest explicative 
power. 
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Table 1: Synthesis of the most significant results for the Culture and Training 
professional set referring to expert n. 5 and to the whole group of experts with 



regard to the evaluations made according to the criterion, difficulties to overcome. 



Pro- I I DS analysis I MDS analysis 

Fes- 




^ ’ The attribution of scores must be read in this way: 1 the most important profession, according 
to the criterion being used, 2 the second one in importance, ... up to 9, the least important; 
The attribution of numbers to the four quadrants was carried out in an anticlockwise way with 
reference to both factorial axes, considering as first quadrant the one whose co-ordinates are 
both positiye; Contrapositions are considered only with reference to the first factorial axis. 
The lists follow a decreasing order in the absolute yalue of the co-ordinate on the first factorial 
axis, so that the first in the list are different, the last are similar. 

Professional list: a Librarian; b Teacher-Training Co-ordinator; c Documentalist; d Elementary 
School Teacher; e Support Teacher; /Archiyes Operator; g Career Adyisor; h Training Planner; 
i Restorer. 

Table 2: Synthesis of the most significant results for the Culture and Training 
professional set referring to expert n. 5 and to the whole group of experts with 
reference to the evaluations made according to future development 
perspectives. 




For (*) - (**) - (***) and Professional list see Table 1. 
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As we can see from Tables 1 and 2, expert n. 5 evaluations are particularly close 
to those of the whole group of experts. In fact, for example, with reference to 
the DES analysis, both expert n. 5 and the group of experts attribute the greatest 
importance to Teacher-Training Co-ordinator, Support Teacher and Training 
Planner, with regard to difficulties to overcome, and to Career Advisor, Training 
Planner and Restorer, for future development perspectives. 

Similar results also come from DS and MDS techniques. In fact, the 
contrapositions arising from professional projections in two-dimensional space 
are the same with reference to expert n. 5 and to the evaluations of the whole 
group of experts: the professions which the most demanding job is connected 
are Training Planner, Career Advisor, Teacher-Training Co-ordinator, Support 
Teacher and Elementary School Teacher, in antithesis to those with more 
operative functions, which are: Archives Operator, Restorer, Documentalist and 
Librarian. It also confirms what emerged from the Derived Subject Weights, 
about the primary role of expert n. 5 in the factorial axes construction. The 
results referring to future development perspectives are quite similar. 

In Table 3 professions in contrapositions in DS and MDS analyses, referring 
only to the first factorial axis, were listed in each column. We can verify that the 
professional clusters in contraposition are exactly the same for expert n. 5 and 
the whole group of experts with reference to difficulties to overcome, while, for 
future development perspectives, the only professions that do not respect this 
principle are Teacher-Training Co-ordinator, on the one hand, and Elementary 
School Teacher, on the other. 

Other important considerations could also be made on differences between 
couples of normalised scores, obtained from, on the one hand, DES analysis, 
that is by dominant eigenvector elements, and, on the other hand, DS and MDS 
analyses, using as scores the first factorial axis co-ordinates, although they refer 
only to a part of total phenomenon variability. These differences highlight the 
gap existing between each couple of professions in accordance to the experts’ 
evaluations. 

There is a positive correlation between single expert evaluations and whole 
group of experts evaluations, according both to difficulties to overcome and 
future development perspectives. This confirms the widely representative role 
of expert n. 5 in the whole group of experts. 
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Table 3: Clusters of professions according to their positioning in DS and MDS 
applications, for (in order) expert n. 5 and the whole group of experts, with 
respect to the first factorial axis and with reference to both difficulties to 
overcome and future development perspectives. 

Aspects 



Difnculties to overcome 



Future development perspectives 



Expert n. 5 - DS Exp. group-MDS Expert n. 5 - DS Expert group-MDS 



Archives 
Operat. Restorer 
Documentalist 
Librarian 

Training Plann. 
Orientation Exp. 
T. Training C. 
Support Teacher 
Elem. School T. 



Support Teacher 
Orientation Expert 
Training Planner 
Elem. School T. 
Teach. Training C. 

Archives Operator 
Restorer 
Documentalist 
Librarian 



Orientation Exp. 
Training Planner 
Elem. School T. 
Support Teacher 

0 

Restorer 
Archives Operat. 
Librarian 
Documentalist 
T. Training C. 



Orientation Expert 
Training Planner 
Teacher Training C. 
Support Teacher 

# 

Restorer 

Archives Operator 
Librarian 
Documentalist 
Element. School T. 



The procedure employed enable us to draw some conclusion: elaborations made 
through DES analysis, on the one hand, and DS and MDS analyses, on the other 
hand, converge to produce the same results, but highlight different aspects: the 
first gives a more complete picture, in one-dimensional space; DS and MDS 
analyses provide information only on a part of total phenomenon variability, 
highlighting associations and dissimilarities which the first one does not do. 
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Abstact: This paper focuses on some solutions for non-metric full- 
Multidimensional Scaling (MDS), minimizing the STRESS and S-STRESS loss 
functions. In particular, the linear transformations of dissimilarities into 
Euclidean distances minimizing the two loss functions are given. A non trivial 
result for S-STRESS with a quadratic transformation of dissimilarities, 
constraining its coefficients, is also obtained. 
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1. Introduction 

Multidimensional scaling (MDS) refers to a class of techniques for 
approximating a (nxn) matrix of observed dissimilarities between n 

objects, with a (nxn) Minkowski distance matrix 

having an associated configuration of n points [jCi,]=Xe of a p (generally 
low) dimensional Minkowski space. Interest in defining “the best” 
approximation in an Euclidean space dates back at least as the seminal work of 
Torgerson (1958). While approximating D a MDS technique should preserve in 
E„ the pattern (manifold) observed in D, thus requiring to: i) minimize a loss 
function between D and E„; ii) preserve a monotone relationship between D and 
E„. A metric MDS procedure satisfies if, while a non-metric MDS satisfies i) 
subject to ii). The two most popular loss functions in MDS are Kruskal’s raw 
STRESS and the S-STRESS, 

2;(X)=^IID-E(X)II^ (la) 

Z(X)=|llD*D-E(X)*E(X)lP, (lb) 

where matrix E(X) is the Euclidean matrix written as a function of the 
configuration of points, and A*B is the Hadamard (direct) product of two 
matrices with equal dimensions. Functions (la) and (lb) are often normalized 
and weighted. 

Therefore, a metric MDS technique involves minimizing (la) or (Ifc) over the 
set while a non-metric MDS method consists in minimizing (la) or (Ifc) 
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between E(X) and y(D), with / an arbitrary monotone transformation, over the 
set 9?"’’’, as well as the monotone transformation/. In practice, this is classically 
done numerically interleaving steepest descent steps on the configuration with 
estimation of / via isotonic regression of the current distances E(X) on the 
dissimilarities D. 

Full-multidimensional scaling refers to the case where the dimensionality of the 
configuration is at most p=n-\, and therefore the number of dimensions is not 
constrained to be small. 

Notice that the non-metric MDS problem is much more complex than the 
metric case. In this paper we study some anal5l;ical solutions for non-metric full- 
MDS, for the Euclidean case (m=2), when the monotone relation ii) is linear or 
quadratic. A numerical example is given to show the proposed solutions. 

An outline of the material in this paper is as follows. Section 2 recalls non- 
metric MDS, and gives three solutions of non-metric MDS. Section 3 applies 
the proposed transformations. Section 4 reports a discussion on the use of these 
transformations and gives some conclusions. 



2. Non-metric Full-Multidimensional Scaling 

Let Z = I-(l/n)ll' be the (nxn) idempotent matrix denoting the orthogonal 
projection onto the orthogonal complement of 1 in 91", where 1 is a n-vector of 
unitary elements. Let D„ be the set of (nx/i) dissimilarity matrices D=\dij : 
dj,=di^0, di,=0 V 1< ij^]. Let E„ and be the sets of (nxn) Euclidean 
matrices Es[e.,] and the corresponding (nxn) squared Euclidean matrices 
E*E=[e?]. Let r(D)=(-‘/2ZDZ) denote the linear and one-to-one Young- 

Householder transform. The set is the closed convex cone of the positive 
semi-definite matrices of order n, with interior the positive definite matrices. 
Let each matrix E be identified by the n(n-l)/2 component vector vec(E) 
obtained vectorizing elements of a triangle below (above) the diagonal of E, 
row-by-row (i.e., ei2, ..., em, ^23, ...). For n=3, the set vec(E^) has the conical 
form shown in figure 1. 

The boundary of vecCEj^) is given by matrices with det(r(D*D))=0, i.e., 

24 ^^3+24 4+24 4-4-4-4=o. 

For S-STRESS the global optimum of metric full-MDS is given by the solution, 
of an eigen-problem, see for example Mathar (1985), while no analytical 
solution is known for STRESS. However, a local optimum found by an 
optimization method turns out to be the global optimum since E„ is a convex 
cone and STRESS is convex, so that there is only a single minimum. 
Geometrically, metric full-MDS consists in finding the point vec(E) or 
(E*E)) of contact with the convex cones vec (E») or ( vec ( E^ )) on the hyper- 
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Sphere with centre in the point vec(D) or (vec(D*D)); while for the non-metric 
case the solution is also bounded by n(n-l)/2 -1 hyper-planes. In figure 1 the 
solution of non-metric full-MDS using S-STRESS, for n=3, and 
vec(D*D)=(3,3,20)' is shown. Since we have it follows that 

ei 2 ^ei 3 < 623 . The sphere with centre (3,3,20)' and minimum radius (S- 
STRESS)‘^^=2V2, contacts in the point ;i^(E*E) = (5, 5, 20)'. The 

two planes d\ 2 -d\ 3 = 0 , di 3 -d 23=0 bound the solution, as it is shown in figure 1. 
After these preliminary remarks on the MDS, we are now in position to state 
some results for non-metric full-MDS. The first identifies the coefficients 
(positive) of the linear transformation of dissimilarities D, which give an 
Euclidean matrix E minimizing rr(D-E)^. 



Theorem 1; Let D € D„. The linear transformation of D: 



E=a(ll'-I)+fcD, 



is Euclidean and such that: min{tr(D-E)^ I De D^, -Vi Z(E*E)Z > 0}, 
i.e., it minimizes the raw STRESS, when: 



b=b* = 



1=1 7=1 «=1 7=1 



(n - 1 )m + X X + 2cX, X ‘^ij 

/=l M 1=1 M 



; a=a* = c b*. 



( 2 ) 



(3) 



where c is the maximum real eigenvalue of 



0 21(D*D) 

-I -4f(D) ' 



Note that c is the minimum additive constant of Cailliez (1983). 



Proof. Matrix a(ll'-I)+D is Euclidean iff a > c the minimum additive constant 
(Cailliez, 1983). Thus, matrix E=fe[c(ll’-I)+D] is Euclidean for every b>0, 
since W = -Vi Z(E*E)Z>0 is positive semi-definite. 

Furthermore, F=rr(D-E)^ = trD^ + b^ tr[c(ll'-I)-i-D]^-2Z7rr{D[c(ll'-I)+D]} and 
the Normal equation is: (dF/d7>) =b tr[c(ll'-I)+D]^-tr{D[c(ll'-I)+D]}=0, so 
that (3) follows. ■ 

The configuration of points is found in at most p=n-2 dimensions since 1 is an 
eigenvector with an induced zero root and an extra zero eigenvalue has to be 
found to have E lying on the boundary of E„. 

The coordinates of E=a*(ll'-I)+^’D are: X=[ui,..,u„. 2 ]diag(A(i),..,A(„. 2 ))'^^, where 
/1 ^i),...,A(„. 2) are the positive eigenvalues of t{a\lV-\)+2ab*'D+b*^D*Yy) in 
decreasing order, and ui,...,u „.2 are the corresponding eigenvectors. A similar 
result can be stated for S-STRESS. 
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Figure 1: Geometrical representation of the non-metric fuU-MDS 
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Theorem 2: Let D € D„, and let a be the minimum eigenvalue of t(D*D). The 
linear transformation of (D*D): 



E*E = d{lV-\) +/D*D (4) 

is Euclidean and such that, min{tr(D*D-E*E)^ I Dg D„, -Vi Z(E*E)Z > 0} i.e., 
it minimizes S-STRESS, when: 



_ tr{j) * D[-2«(ir-I) + D * D]} . 



d=d* = 



( 5 ) 



Proof. Matrix a(ll’-I)+D*D is Euclidean iff d > -2a (Lingoes, 1971). For every 
/> 0 matrix E*E = Z>[-2a(ll'-I) + D*D] it is such that E is Euclidean. 
Furthermore, G = tr(D*D-E*E)^ = tr(D*D)^ fr[-2a(ll'-I) + D*D]^ + 
-2^r{D*D[-2a(ll'-I)+D*D]}, and the Normal equation is 
(dF/d/)=M-2a(ll'-I)+D*D]^-tr{D*D[-2a(ll'-I)+D*D]}=0, so (5) follows. ■ 



Also in this case the configuration of points is found in at most p=n-2 
dimensions with coordinates: X=[ui, ..., u„. 2 ]diag(A(i),..., where A(i),..., 

A(„- 2 ) are the ordered eigenvalues of t(J*(ll’-I)+/*D*D), and ui,...,u „.2 the 
corresponding eigenvectors. 

The results given by theorem 1 and 2 are a non trivial consequence of the 
additive constant problem. More complex is the problem to find a non-linear 
and monotone transformation of the dissimilarities which is Euclidean and 
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minimizing the (la) or {lb). A first result is given for {lb), when a quadratic 
mapping is used. 

Theorem 3; Let DeD„, and let a and be minimum eigenvalues of -*/ 2 Z(D*D)Z 
and -Vi Z(D*D*D*D)Z respectively. The quadratic transformation of (D*D): 

E*E = g(ll'-I) + h D*D +/ D*D*D*D, with g,h,l >0, (6) 

is such that E is Euclidean if, 

g,h > 0 and / > {-g/2 -ah)/ p >0 (7) 

and min{tKD*D-E*E)^ I Dg D„, -‘/ 2 Z(E*E)Z > 0, g,h > 0 and l={-g/2-ah)/p>0}, 
i.e., the S-STRESS subject to constraints (7) is minimized for 

*- tr{D,F)tr<i^ -fr(FG)fr(D,G) trF^fr(D,G) - fr(FG)fr(D,F) 

trF^trG^ - (trFG)^ trF^trG^ - (trFG)^ 

l=f={-g*l2-(OC)lp. (8) 

where D 2 =D*D; F=(ll -I)-(l/2i3)D*D*D*D; G=D*D-{a/p) D*D*D*D. 

Proof: Matrix W =-'/2 Z(E*E)Z has to be p.s.d or x'Wx>0 Vxg 91", with x'x=l. 

x'Wx= 1/2 g x'Zx + /ix'[-i /2 Z(D*D)Z]x + Zx'[-i /2 Z(D*D*D*D)Z]x. (9) 

Since x'[-i /2 Z(D*D)Z]x > ctx’x, x '[-‘/2 Z(D*D*D*D)Z]x > )3x'x, 
x'Zx^O for g,/i>0, x'Wx = {Vig + odi + pt)x'Zx^ if (7) holds. 

Furthermore, 

H = tr{D2 - g(ll'-I) - / 1 D 2 + [(-g/2 -ah)/p]D4)^ = 

= fr + g^ trF^ + h^ trG^ - 2gtr D 2 F + 2h 7rD2F + 2gh tr¥G. (10) 

The system of Normal equations is: 

{dHjdg = frF" +htr¥G = trD^F 
[dH/dh = trFG + htrG^ = trD^G 

from which (8) follows. ■ 

It has to be noted that theorem 3 finds the solution of a constrained problem on 
the coefficients and not of the unconstrained one. In this case the configuration 
of points is found in at most p=n-l dimensions: X=[ui,...,u„.,]diag(A<,),..., A<„.i))*^^ , 
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where A<i),..., are ranked eigenvalues of r(g*(ll'-I)+/i*D*D+/*D*D*D*D) 
and are the corresponding eigenvectors. The optimal least squares 

approximation (6) of the dissimilarity matrix D, can be obtained solving the 
following quadratic constrained problem: 

minimize tr(D * D - E * E)^ 
subject to 

< (12) 
E * E = g(ir-I) + /i(D * D) + /(D * D * D * D) ^ ^ 

A,„.,)(r(E*E)) = 0 

where A<„.i)(r(E*E)) is the second smallest eigenvalue of r(E*E). Problem (12) 
has been solved using a Sequential Quadratic Programming (SQP) algorithm. 
Comparative studies of non linear programming algorithms indicate that the 
SQP algorithm performs very well in terms of successful solutions, with a 
superlinear rate of convergence. An overview of SQP methods is given in 
Powell (1983). The analytical solution g*, h*, I*, (8) can be used as initial guess 
for problem (12), and in the example we analyzed the rate of convergence was 
improved. Often the optimal solution of problem (12) is close to the analytical 
solution given by (8). 



3. Numerical Example 

The dissimilarity matrix D reported in Table 1 has been defined with an uniform 
distribution in [0, 10]. D is not Euclidean since r(D*D) has two large negative 
eigenvalues (-25.8640 , -9.5391). 



Table 1 : Dissimilarity matrix D 



0 


5.3685 


0.5950 


0.8896 


2.7131 


4.0907 


4.7404 


5.3685 


0 


3.2896 


4.7819 


5.9717 


1.6145 


8.2947 


0.5950 


3.2896 


0 


8.1212 


6.1011 


7.0149 


0.9220 


0.8896 


4.7819 


8.1212 


0 


8.3864 


4.5161 


9.5660 


2.7131 


5.9717 


6.1011 


8.3864 


0 


9.5169 


6.4001 


4.0907 


1.6145 


7.0149 


4.5161 


9.5169 


0 


6.1094 


4.7404 


8.2947 


0.9220 


9.5660 


6.4001 


6.1094 


0 



The best linear least squares approximation of matrix D in table 1, 
according to STRESS is yielded applying the result of Theorem 1, that gives: 



E = 3.8720(11'-!) + 0.3098D, 
with STRESS = 75.9819; 
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The best linear least squares approximation of matrix D*D, according to S- 
STRESS is obtained applying Theorem 2, that g^ves; 

E*E = 23.6578(11’-!) + 0.4573D*D, 
with S-STRESS = 5443.2 

The best least squares quadratic constrained by (7) approximation of D*D, 
according to S-STRESS is achieved applying Theorem 3, that gives: 

E*E = 25.2658(11'-!) + 0.3067D*D + 0.0018D*D*D*D, 
with S-STRESS = 5404.0. 

Problem (12) was solves using the SQP algorithm after 153 function evaluations 
with a convergence constant equal 10'’; 

E*E = 53.4127(11'-!) + 0.0102D*D*D*D, 
with S-STRESS=3365.0. 

This last Euclidean matrix has two zero eigenvalues. 



4. Discussion 

In this paper non-metric fuU-MDS is discussed. Three solutions are given in the 
cases where the relation between D, (D*D) and E, (E^E) is linear or quadratic. 
We suggest the use of these solutions: (a) as initial guesses for iterative global 
non-metric MDS procedures; or (b) before appl 5 dng a metric MDS technique, 
because in this case often the configuration associated to E and obtained in 
reduced dimensions by a metric MDS is less distorted. This reduced distortion is 
confirmed, for example, when a clustering technique is applied on D, (D*D) and 
the solution is compared with those two obtained on E, (E*E) (identifying a 
configuration in 2 dimensions) using or not the above transformations before 
applying a metric MDS technique. 

For example, in Figure 2 the dendrogram of the single linkage applied on the 
square of dissimilarities in Table 1 is shown. The dendrogram obtained applying 
the single linkage algorithm on the best least squares Euclidean approximation of 
matrix D*D is shown in Figure 2 (b). It can be noted that the two dendrograms 
(Figures 1 a and b) exhibit different classifications at different levels of fusion. 
The dendrogram in Figure 2 (c), obtained applying single linkage on the 
quadratic approximation given by Theorem 3, presents the same topology of the 
dendrogram in Figure 2 (a), with slightly different lengths of linkages. This 
confirms that the approximation in case (6) produces a larger distortion in the 
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classification pattern observed in Table 1, with respect to a monotone 
approximation. 

Figure 2; Comparison among the single linkage’s dendrograms i^)plied on the; (a) 
square of dissimilarities in table 1; {b) best least squares Euclidean approximation of 

(g) Single linkage applied on the D*D, with D in Table 1 
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(b) Single linkage applied on the best least squares Euclidean approximation 
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(c) Single linkage applied on the quadratic Euclidean approximation of D*D: 

0018D*D’»D*D 



E*E = 25,2658(11’-I) + 0.3067D*D + 0, 
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Dynamic Factor Analysis 
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Abstract : This paper represents an extension of Dynamic Factor Analysis 
(AFD) models proposed in the ‘70s by Coppi and Zannella. AFD models are 
specific for data-array whose third dimension is time. They consider time as an 
explicit element which gives rise to part of the observed variability. AFD 
models integrate two different strategies. The first aims at studying the 
relationships between variables and units, averaged over time, by factorial 
analysis of specific covariance-matrices. The second aims at studying time 
evolution of both variables and units by time regression and autoregressive 
models. 

Key words : array of data or cubic matrices, factorial analysis, time series, 
regression and autoregressive models. 



1. Introduction 

Since the years 1980’ s a growing interest has been given to the statistical 
treatment of data classified according to the following three criteria (or modes, 
cfr. Tucker 1966): statistical unit, quantitative variable and time of data 
collecting. This kind of data may be represented in a cubic matrix X (Law and 
others, 1984) whose generic element is xyt, where i=l,...,N is the unit index, 
j=l,...,J is the variable index and t=l,...,T is the time index, when we observe 
the same units and variables in each time (or occasion). 

Different statistical models have been proposed for this kind of data. The 
Dynamic Factorial Analysis (AFD) models are one of these (Coppi, Zannella, 
1979). 

In this work, we will discuss the possible extensions of AFD regression models, 
used to describe the time variability of the array, to polynomial functions in t of 
order greater than 1 and non linear functions. The possibility of introducing 
autoregressive models will be considered. Finally parameter estimates will be 
discussed from the data reconstitution point of view for AFD model I. This 
allows us to join both the regression and the factorial strategies. 

In this context a descriptive approach has been used instead of a probabilistic 
one. 
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2. Data preprocessing 

The array X can be reduced to a bidimensional matrix, depending on which of 
the two index i or t is used as the external one. In fact X can be reduced to a 
matrix of dimensions IT x J, by overposing the matrices {Xt, t=l,...,T}, where 
Xt is the matrix units by variables observed in time t, or X can be reduced to a 
matrix of dimensions TI x J, by overposing the matrices {Xi, i=l,...,I}, where 
Xi is the matrix times by variables for the unit i. 

We can introduce weights for each mode of the array X. As regards units, we 
consider the diagonal matrix D(IxI)= {dj, i=l,...I; Sidj=l}. For the variables we 
consider the diagonal matrix M(JxJ)={mj, j=l,...,J}. In AFD we suggest as mj 
the reciprocal of the mean of the IT observations for variable j. The main aim of 
weigthing variables is to eliminate differences in measurement units and 
character intensities, which could influence the analysis, by modifying 
comparisons between and/or within occasions. As regards time we consider the 
diagonal matrix L(TxT)= {h, t=l,...,T; Silt=l }. Greater weights can be attributed 
to central time values, and smaller weights to extreme time values of the period 
of interest, following an approach of moving average in time series analysis. 
Each element Xyt is weighted by the quantities (di • mj • h) which can be 
considered as elements of an array P, obtained as the following tensorial 
product (dxm’}<E>l, where d, m, 1 are the main diagonals of the matrices D, M, L 
respectively. The total sum of the elements of P is the trace of M. 



3. Sources of variation in X 

AFD considers three sources of variation which describe the observed data in 
the period of interest. 

The first one is the time evolution of each variable averaged over units. The 
corresponding covariance matrix is S^: 

*St =[(I-1 X I’)*Xt]’ L [(I-l X l’)*Xt] 

where *Xt= {x. jt j=l, ...,J t=l, ...,T}, and x.jt=Xi Xyt • di. 

The generic element of *Sj is ,s*j, = X,(Xjt “ t ~ ^.j .) ‘ • 

The second source of variation considers the structural relationships between 
units and variables, averaged over time. 

The corresponding covariance matrix is Sj : 

*Sj =[(I-1 X d’)*X|]’ D [(I-l X d’)*Xi] 
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where *Xf { Xy, i=l, ...,I j=l, and Xy. =It xyt • It . 

The generic element of *8} is ■ Sjj. = ~ “ ^j .) ’ • 

The third source of variation rises by the differential time evolution of the units, 
resulting from the interaction of the two modes unit and time, without the 
interactions of the other two-mode combinations, unit-variable and variable- 
time. This variability is described by the covariance matrix Su of the values 

(Xijt- Xjj. - X.jt + Xj.). 

The three covariance matrices rise from the decomposition of the total 
covariance matrix S of the array X. 

Considering X as a two dimensional matrix, S can be obtained as follows: 

s={[i-i(i ®d)’]X}’diaga ®d){X[i-ia 



when X results from the overposition of Xt and (1 <8> ^ = 



lid 

l2d 



lid 



I is the identity matrix of order IT, and 1 is a vector of IT I’s; or 
S={[I-l(d®D’]X}’diag(d <8>I) {X[I-l(d ®1)’]} 



when X results from the overposition of the matrices X| and (d ® 1) 



dll 

dj 

dd 



I is the identity matrix of order TI, and 1 is a vector of TI elements. 

It can be shown that S=*Si + *St + S^. 

Finally we can define the following covariance matrices: 

S,=St S(t) *lt t=l, ...,T, where S(t) is the covariance matrix of the observations 
at time t; 

S,=XiS(i)’di i=l, ...,I, where S(i) is the covariance matrix for the unit i, 
considering times as observations. 

It can be shown that S, =*Si+Sit and S, =*St + Su. Looking at the decomposition 
of S, we can write S in the following two ways: S= S, + *St or S= S, + *Si. 

In AFD the three sources of variations are described using factorial methods and 
time regression models. 
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4. AFD models 

The first three AFD models consider different strategies to analyse the “static” 
variability and the differential time evolution of the units. 

As regards time evolution of the centres x j, all of the three AFD models 

consider a linear regression model for each variable j, where the independent 
variable is time. The parameters are obtained by ordinary least squares. The 
assumptions about residuals are the classic ones: cov[e.jt,e.j’t’]=Wj, if j=j' e t=t', 
and 0 otherwise. 

The variability of the centres Xy is analized by factorial analysis of specific 
covariance matrices in each of the three AFD models. 

In the first model, factorial analysis is applied to the covariance matrix S, . By 
projecting the matrices Xt centred in each time we obtain the factorieil 
representation of each unit in each time. The representation of the centres Xy is 

obtained by projecting the matrix *Xj centred, on the factorial plane, reminding 
the decomposition S, = *Si + Su. 

In the second AFD model, factorial analysis is applied to the matrix *Sj. In this 
way we obtain the factorial representation of the centres Xjj . 

In the third model, the factorial analysis is applied to the matrix =Z( yS(t)-lt, 
where and yS(t) are obtained as in the first model, but considering observed 
data processed as follows: yijt =Xyt - Xy, , where Xyj is the time regression value 

corresponding to xyt , describing the total time evolution of unit i for each 
variable j. 

As regards differential time evolution of the units, in the first model it is 
described by comparing the projection of each unit in each time, with the 
projection of the corresponding centre Xy : 




where Fih. and Fj^ are the factorial scores corresponding to the unit i, on the 
factor h; Cj,, h=l, ..., H is the eigenvector corresponding to the h° eigenvalue of 

S ^ . This representation rises from the decomposition S,= Si + S|t. 

Both the second and the third AFD models describe the differential time 
evolution of the units, starting from a time regression model for each unit, 
whose parameters are calculated by ordinary least squares: 



Xijt=a.,+b..-t+eijt, j=l,...,J and i=l,...,I 
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the assumptions are: cov[eijt,eij-t’]=wj if j=j' and t=t'; 0 otherwise. 

Differential time evolution of each unit can be measured considering the 
differences of the two regression parameters bj and by, j=l,...,J. 

The different strategies are based on the two possible decomposition of the 
matrix of total covariances S, S= S, +*Sj and S= S, +*S.. Model I considers the 
first decomposition; both model II and El consider the second decomposition, 
but the third model uses the following approximation: yS,= Sj. 



5. Indices of the quality of fitness of the model to data 

For each source of variation, indices measuring the quality of fitness of the 
model to data are calculated. As models are linear, the indices are calculated 
considering the trace of both the observed and the theoretical cov matrices. 

As regards the centres x j, , the quality index is the same for the three models: 

*I=tr(*S,)/tr(*S^) 

where S, is the covariance matrix of the regression values of x jt . 

As regards factorial structures, the indices for the first model are the following: 



i.=[IhCh'S,-Ch]/tr(S.) 



where cjj is the eigenvector corresponding to the eigenvalue h° of S, . 

The following index measures how much variability is described in each time- 
occasion by the factorial plane: 

I(t)=[Sh^h' S(t)-c^]/tr[S(t)] with t=l,...,T 

As regards the centres Xy the corresponding quality index is: 

*Irl2A'*SAl/tr(*S,) 



Finally the quality index of the differential time evolution of the units is : 
lr{ZA'S,A)Ar(S„) 
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In the second model the quality index of the factorial structure is: 
*l={IhV*Sr*cJ/tr(*S,) 

In the third model the index is: 

As regards differential time evolution of the units in the second and the third 
model, the quality index is: 

I,=tr(S„)/tr(S,p 

where S„ is the covariance matrix of the values Xy; - Xj, resulting from 
regressions. The index corresponding to the total time evolution of the units is: 

!i=tr(S,)/tr(S,) 

where S, is the theoretical covariance matrix, calculated as the mean over units 
of the theoretical covariance matrices of the regression values of each unit. We 
can consider the same kind of index for each unit: 

I(i)=trS(i)/tr[S(i)] i=l,...,I 

where S(i)is the covariance matrix of the regression values of each unit, 
considering times as the observations. 

The summary indices of the total fitness of models to data are: 

Model I : I=[ I, • tr( S, ) + *1, • tr(*St)]/ tr(S) 

Modello H: I=[ li • tr( S, ) + *Ii • tr(*Si)]/ tr(S) 

Modello m: I=[*It • tr(*St)+ yit • tr( ^ S, ) + lit • tr(Sit)]/tr(S) 

6. Possible extensions of time regression models 

It can be easily proved that: 

1. 1 , = max , and 1 , = max , where Bt and Bh represent the classes of 

all linear estimators respectively for the regression models x j, =aj+bj-t+e.jt, and 
Xijt^aij+bij't+eijt i=l,...,I, j=l,...,J; 
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2 . bj=L bij • di 

As regards the time evolution of the centres x j, and, for the models II and HI 

time evolution of each unit, we can consider more general models, as 
X j, =jf(t)+e.jt and Xijt=ijf(t)+eijt, where jf(t) and ijf(t) are generic polinomial or 

non linear functions of time, corresponding to variable j. The assumptions about 
e.jt Cijt are the same as in the case of the linear regression. In a descriptive 
context, the parameters can be calculated by ordinary least squares, whose 
solution can be obtained by numerical methods. 

Considering jf(t) e ijf(t) as polinomial functions in t of the same order, the 
vectors of the parameters have the same properties we saw in the case of the 
simple linear regression: 

1. the indices of fitness are maximus; 

2. parameters for x j, are obtained as the mean over i of the parameters of xyt. 
The differential time evolution of the units can he measured by the parameters 
u b-jb , easily calculated considering centred data. 

These considerations are valid with respect to more general functions, linear in 
the parameters; 

x.jt = jbo+jbi-gi(t)+jb 2 -g 2 (t) +...+jbp-gk(t) +e.jt 

Xyt= ijbo+ijbr gi(t)+ijb2- g2(t) +...+ ybp- gk(t) +eyt 



When there are enough occasions, it is possible to introduce an autoregressive 
model for each variable, directly fitted to raw data if they exhibit stationarity, or 
to data after having removed the trend, which could be estimated by a linear 
time regression. The important aspect is the study of the time stmcture of the 
model as a supplementary source of information (Piccolo, 1974). 



7. Reconstitution of the array X 

We can reconstitute initial data when the model parameters are found by 
minimizing a lost function, which measures the fitness of the model to observed 
data. Let us consider the following lost function: Zyt (xyt - Xy, f (2) 

xyt can be written as xy,= (xyt - j - xy. + xj. ) + ( xy. - x j. ) + xyt (3) 

where each part represents a different source of variation, and is parameterized 
in different ways in each AFD model. 

In AFD model I the first two elements of (3) are described by a factorial model, 
the third one by a time regression. It can be proved that minimizing (2) by 
considering the appropriate expression for Xy, , gives us the same parameters for 

the factorial representation and the regression model as indicated above for the 
model itsself. We can reconstitute the observed data as xyt=Sh Fith ajh +Xj,, 
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where Xh Fith ajh is the factorial representation of St, and X j,is the time 
regression value for x . Similar considerations can be made for the other two 

AFD models, with more complications due to the more complex model 
structures of the various sources of variation. 



8. Conclusions 

AFD models represent an alternative to three-mode data analysis models such 
as the STATIS and the TUCKERs’ models. AFD models seem to be more 
convenient for three-mode data whose third dimension is time, because they 
consider time as an explicit element which give rise to part of the observed 
variability in the array X. In this sense, time is considered as a dimension of 
different nature with respect to the other two dimensions, the unit and the 
variable. 

Parameters from time regression models and considerations about time structure 
from autoregressive models considerably enrich the information about the 
relationships between units and variables given by factorial models. 
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Abstract: The paper provides a contribution to factorial methods in 
multidimensional data analysis covering the gap of graphical representations of 
statistical units on which a multiple set of response variables as well as a 
common set of explanatory variables are observed. By joining the features of 
multiple Co-Inertia analysis with those of a geometrical non-symmetrical 
approach, the proposed technique gains remarkable advantages in identifying a 
typology of statistical units generated by the mentioned dependence structure. 

Keywords: Multiple Sets, Co-Inertia, Orthogonal Projections, Graphical Displays. 



1. Introduction 

The prediction of multivariate responses by multivariate predictors is a relevant 
problem in applied Statistics. In real applications, a “true model” for the link 
between responses and predictors is seldom available and, in any case, does not 
provide a graphical insight for a better understanding of the data structure under 
study. Therefore, a geometrical approach seems more reasonable and helpful. 

In this paper, we aim at visually inspecting the dependence structure between K 
sets of response variables with respect to a common set of explanatory 
variables. This approach helps in graphically identifying a typology of the 
statistical units induced by those sets of response variables with the highest 
discrimination power among them. 

The usefulness of this technique in real applications is of primary concern to 
quality control problems, as it will be shown by an example on water pollution. 



2. The Data Structure 

The data structure at hand is similar to the typical one of Generalised Canonical 
Correlation Analysis (GCCA, Carroll 1968). In fact, it comprises K sets of 
qj(j=l,...,K) quantitative response variables observed on the same n statistical 

units, represented by the K matrices Yj of dimensions (nxqj). Therefore, the I^’s 
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are row-wise paired matrices whose rows belong to different multidimensional 
spaces. 

Moreover, as we refer to a dependence structure, i.e. we work in a non- 
symmetrical context, we have one set of p quantitative explanatory variables 
observed on the same n units, represented by the X matrix of dimensions (nxp). 
All variables are considered preliminarily centred with respect to their own 
arithmetic means as well as standardised to unit variance. As we work on pre- 
processed data, in the analysis we can refer to classical Euclidean metrics for 
interpreting graphical representations. 

In such a context, three different strategies of analysis are possible: 

1) a global analysis searching for a common plane where to compare the 
different matrices; 

2) a synthesis analysis searching for a compromise matrix whose factorial 
representation aims at defining homogeneous classes of the statistical units; 

3) a detailed analysis which, by means of the projections in supplementary of 
each matrix on the factorial planes of the synthesis analysis, allows to compare 
both statistical units and variables in the different matrices. 

In literature, several techniques (e.g. STATIS, Lavit 1988; Analyse Factorielle 
Multiple, Escofier & Pages 1984; Multiple Co-Inertia Analysis, Chessel & 
Hanafi 1996) have been proposed for the analysis of multiple tables but they 
refer only to the symmetrical structure of interdependence. 

In the following, our technique is developed in the direction of Multiple Co- 
Inertia Analysis but in a non-symmetrical context. 



3. The Analysis 

Geometrically speaking. Multiple Co-Inertia Analysis studies the response 

variables structure as K clouds of n points in the spaces regardless of any 
explanatory variables. On the other hand. Non Symmetrical Comparative 
Analysis of Co-Inertia (Esposito 1997; Balbi & Esposito, 1997) takes into 
account the explanatory variables but is confined to deal with just 2 tables 
which must also be totally paired, i.e. same response variables observed under 2 
different conditions. Hereinafter, we aim at dealing with multiple sets 
(^T > 2) of response variables as well as with row- wise paired tables. 

The analysis is set in the geometrical framework of Principal Component 
Analysis onto a Reference subspace (PCAR, D’Ambra & Lauro, 1982). PCAR 
looks for the principal components of the image of the response variables on the 

subspace (in our case, R^) spanned by the explanatory ones in order to take 
into account the non-symmetrical relationship between two sets of variables. 

Thus, we project thely’s on the space through the common orthogonal 
operator = X{X' X)~^ X' so as to define the new matrices yJ = Yj . 
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In the core of the analysis we first search for K block-normalised vectors wj ’s, 
one for each subspace spanned by the variables in , as well as for one 
normalised auxiliary variable z* in the statistical units subspace /?” . After that, 
vectors H'j’s and z^ are searched to be orthogonal with, respectively, wj’s and 
z* . The choice of these vectors is based on the maximisation of: 



(1) 

7=1 

where each represents the weight assigned to each Yj . Such a weighting 

system is necessary in order to take into account the different number of 

response variables in each (i.e., used in the application) or the 

j 

different variability of each response variables set (i.e., k j=VarYj I^VarY j ). 

j 

The quantity in (1), as Var{z) = 1 , may be differently expressed as: 



f_ir,Var{r,y,y<,)p‘{r,Y,yi,.z) 



( 2 ) 



where Vdr( ) stands for the variance operator and p() for the correlation 
coefficient operator. 

The equivalence in (2) points out that the proposed analysis jointly maximises 
the criteria of, respectively, PCAR (the term Vdr(P,iY^w^ )) and GCCA (the 

term p^(p,jY^.w^.,z)). Consequently, it succeeds in performing K separate 

analyses together with a global one of the Yj ’s. 

By referring to the Cauchy-Schwartz inequality (Chessel & Hanafi, 1996), it can 
be easily shown that the majoring quantity of the expression in (2) is defined by: 



X;tJy;'z =z' 



( K ^ 

£>r,p.Y,Y;p, 

Vm 



(3) 



As the ^ ’s are paired by rows, the first order solutions Wj ’s and z* are derived 

^ - r * * * 1 

from a PCA performed on the pooled matrix Y =IYi IY2 1— J • 
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Namely, the axes wj’s are the -dimensional vectors of the principal axis 
associated to the highest eigenvalue of a PCAR on the pooled K sets ^ ’s with 

respect to X, while the variable z* is given by the relative principal component. 



With respect to the second order solution w]’s and z^ , we work on the residual 
matrices E^. = Yj - Y*P^^ where is the orthogonal projection operator on 
the subspace spanned by the vector w j . 



By juxtaposing all Ej ’s, we define a pooled matrix E = 



E-^ \E<2\,,\E ^ 



Finally, the first order solutions of E represent the second order solutions of Y. 
Similarly, by iterating the procedure, the generic r-th order solution may be 
found. 

The proposed technique provides one system of Co-Inertia axes where to 

^ Q * 

project the K clouds of statistical units relative to each matrix Y j in R ’ . These 



representations, according with the chosen maximisation criterion, allow a 
global comparison of K configurations of the statistical units. Consequently, 
peculiar configurations of the units may be inspected. 

On the same system, we can display the axes of K separate PCAR’s with the 
objective of representing the inertia structure of each table itself. 

More importantly, aiming at enhancing a typology of the units, we normalise all 
sets of co-ordinates to 1. hi this way, we are allowed to simultaneously project, 
on a common display, both the K representations of each unit and the relative 
components of the auxiliary variables. Thereafter, star-plots are drawn which 
characterise the K behaviours of each unit with respect to a synthesis of theirs. 
With regards to the variables, both explanatory and response ones are 

represented on the plane spanned by the auxiliary variables in/?” in order to 
show the links between the K sets Yj as well as the influence of the variables in 



X on them. 



4. An Application on Water Quality Control Multivariate Data 

The proposed technique is very useful for visualising quality control 
multivariate problems (Lauro, Scepi & Balbi, 1996, for a review). In particular, 
it helps in identifying both a control typology and the causes of eventual out of 
controls. The latter is a very important topic in the analysis of multivariate 
control process in which it is very difficult to detect the variables actually 
determining out of control situations. 

In the following example, we refer to the well-known data by Vemeaux (1973) 
relative to a study in hydrobiology. The aim is to investigate the water quality of 




183 

the French River Doubs taking into account both its ichtyologic fauna and its 
chemical, physical and biological components. 

Therefore, two data matrices are available. The first one crosses 35 sites 
observed along the river with 11 variables (explanatory ones) referring to 
morphological features of the river (distance from the source, altitude, slope, 
minimum mean flow) and to water quality indicators (pH, total hardness, 
phosphates, nitrates, ammoniacal azote, oxygen. Biological Request of 
Oxygen). The second one crosses the same sites with an index of presence of 27 
species (response variables) of fishes in each site. These data have been 
analysed by Chessel et al. (1987) in the framework of Canonical 
Correspondence Analysis. They result in partitioning the 27 species into 8 
groups. In our analysis, we take into account their partition. However, as there 
are two groups each formed by just one species, we first aggregate these groups 
to the nearest groups on the factorial plane therein obtained. We thus finally 
consider a 6 groups’ partition of the species. 

By applying our technique to the same data, taking into account the results by 
Chessel et al. (1987), our added value consists in identifying a common 
structure of the sites so as to detect “anomalous” situations. 

Figure 1 represents the discrimination power of each group of species, that is its 
capacity in yielding a typology of the sites. This capacity is computed, for each 
axis, as the squared covariance between each system of co-ordinates and the 
auxiliary variable relative to the same order. 

It is clearly shown that group 3 is the one with the highest overall discrimination 
power. For this reason, in the following, we show some of the graphical 
representations relative to group 3 in order to explain the interpretative power of 
the technique we propose. 
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Figure 2 refers to the representation of sites in the species space in group 3. The 
axes, displayed as full lines, are the ones obtained by performing a single PCAR 
on group 3. They almost match with the principal axes of GCCA (dotted lines). 



Figure 2; Representation of Sites in the space relative to Group 3 




The first axis (the horizontal one) discriminates between sites nearest to the 
source (on the left) and those farest from the source itself (on the right). In fact, 
the sites are numbered from si to s35 on the basis of their distance from the source. 



Figure 3; Explanatory Variables and Group 3 Species 
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In order to understand the meaning of the second axis (the vertical one), we 
must interpret the display in Figure 3 that represents both explanatory variables 
and the 9 species (long words in lower-case) in Group 3. 
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The first axis opposes altitude to distance, as they are logically inversely related, 
and discriminates among the other morphological features (slope, hardness, and 
flow) relative to the position of the observed sites. The grouping of the species 
on the right hand side of this axis implies that their presence is very much 
influenced by the distance from the source. They are actually species usually 
living very far from the source. 

The second axis, instead, discriminates among the chemical components of the 
water thus representing a synthesis of its quality. In particular, species suited to 
live in waters rich of azote, phosphates and BRO oppose to those who prefer 
oxygen, pH and nitrates. 

In Figure 4, the star-plots of the behaviour of just the ‘anomalous’ sites in the 
different groups are represented. The centre of each star relates to the synthesis 
representation while its edges relate to the behaviours in the different groups. 
Moreover, by ‘anomalous’ sites we mean those who do not strictly conform 
themselves to a common behaviour around the origin as most of the sites do. 




The sites very far from the source have a synthesis configuration that, with 
respect to the shape, are very similar to each other and altogether form a 
configuration substantially different from the other sites. However, each of them 
has a very different variability, e.g. s32 has a low variability since its 
representations are very close to each other, while s35 has a very high 
variability due to its different behaviours in groups 1, 2 and 4. 

On the other side of the first axis, it is worth noting that si has the highest 
variability among all sites, and its synthesis may be considered quite anomalous 
due to its behaviour in group 1. 

The positioning of the other sites on the factorial plane may be similarly commented. 
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5. Conclusions 

The proposed technique represents a contribution to the non-symmetrical 
approach to the simultaneous analysis of multiple data sets. This area has not 
yet received a proper attention with respect to its relevance in real applications. 
In this perspective, it is important to note the awareness of the technique for the 
role played by the statistical units. Graphical representations of their 
configurations, both in each set and as a whole, allow better understanding of 
data at hand. 

The approach we follow is entirely geometrical. The enrichment of the results 
by means of inferential tools represents a future task of research to be 
accomplished. Moreover, the idea behind this paper may turn out to be very 
profitable for studying complex covariance structures via a modelling approach. 
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Abstract: In this paper we present a non parametric adaptive procediure for the 
principal components non linear representation {Principal Surfaces, PS, 
LeBlanc and Tibshirani, 1994) in Constrained Principal Component Analysis 
{CPCA, D'Ambra & Lauro 1982, 1992 see also Principal Component Analysis 
with Instrumental Variables, Rao 1964; Redvmdancy Analysis, van der 
Wollenberg, 1977). 

Keywords: Constrained Principal Component Analysis, Principal Surfaces, 
Multivariate Adaptive Splines. 



1. Introduction 

In applied multivariate statistical analysis, dimension reduction is an important 
problem related to the realization of an appropriate and easy representation. 

For the attainment of data visualization and interpretation simplicity, in this 
paper we develop a generalization of Constrained Principal Component 
Analysis (CPCA), where we use the Principal Surfaces (PS) relaxing the 
linearity assiunption of the final reprentation. This means that the non-linearity 
does not concern the original data matrix as in Non-linear PCA (Gifi 1990), but 
the component scores matrix (LeBlanc and Tibshirani, 1994). 

The principal components are transformed by multivariate B-splines (PS) of 
zero or one degree than higher degree B-splines, as the last are heavy to 
compute and difficult to interpret. In particular crisp coding often generates 
well separated and contiguous representations easy to interpret (van 
Rijckevorsel 1987). 

In the next sections, we first introduce the PS construction in the CPCA context; 
afterwards we present the computational procedure for the performance of 
Principal Surface Constrained Analysis (PSCA). This is based on a forward 
phase for the choice of the univariate spline optimal knot sequence ( fixing the 
knots number ) and on a backward phase for the optimal detection of the space 
dimension. 
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2. Adaptive Principal Constrained Surfaces 

Studying the dependence structure between two sets of quantitative variables 
(predictors) andY^^ ^j (responses), with n observations, q predictors and 

p response variables (p<q)\ the non-linear generalization of CPCA, is here 
developed by estimating the generalized factor model: 

yy=/(c,) + e, yj = \,...,p (1) 

where is: y j the y* column of the response matrix projected onto the subspace 
of the predictors; Cj (the y* column of the component score matrix C) the 

parameter vector, which takes values into and each ficj) a non linear 

function mapping ^ into i?”. 

We minimize the expected square error: 



E\ 



m 

-f is- j)f 



( 2 ) 



Hastie & Stuezle (1989) describe the Principal Curve as a non linear 
transformation of a principal component. Furthermore they show that a 
principal curve is a critical point of the squared distance function (2), in this 
sense it generalizes the minimum distance property of linear principal 
components. Le Blanc & Tibshirani use the tensorial product among principal 
curves and construct Principal Surfaces as multivariate splines. 

In coherence we use B-spline functions for they(c/) for their known property of 
flexibility and prefer zero or one degree B-splines {PS), because they are easier 
to interpret and compute. 

So our problem comes out to be the optimal parameter estimation of the 
following transformation: 

M K I V 

/(C) = A + X A 0**(*.m) ) (3) 



where: are the coefficients for the construction of the spline transformation; 

Mis the number of the basis functions of the multivariate spline; K„ is the basis 
number of the univariate spline to construct tensors and b is the basis function 
of the generic component Cj(^k,m) , on the partition of the multivariate domain, 
used in the term of the product. 

In our CPCA context we obtain for each response variable a PS on which units 
are projected. These PS allow a suitable non-linear representation of the non- 
symmetric relationship among responses and predictors, generating well 
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separated and contiguous representations. In the classical representation it 
might be more difficult to inspect and interpret similarities-dissimilarities 
among units. For example, units, apparently located near each other in the 
classical bi-dimensional plot, could belong to different interval grid areas. 
Us ing zero degree B-spline, the PS are parallelepipeda. The grid levels are the 
coefficients, computed by regressing Y with respect to the non-linear 
component tensor (i.e. PS, the multivariate spline). Different units groupes 
appear on different grid levels for each variable surface. In a way we classify 
units according to a specified variable; the higher the grid level, the stronger 
the variable influence on units. 



3. The Computational Procedure 

The algorithm we propose for the optimal selection of the principal surface and 
for the dimension reduction, uses the recursive partitioning algorithm proposed 
by Friedman (1991), in the optimal knot detection phase. 

It can be synthetized as follows: 

Preliminary phase: 

The procedure starts from the computation of the linear principal components 

Cp...,C„. 

Forward phase: 

Fixing the degree of spline and the number of knots, (n-2) possible 
transformations are computed in correspondence with all the possible unit 
values, excluded the minimal and the maximum ones (being, respectively, 
values on which the left and right external knot are computed). The algorithm, 
performed for each component, chooses the knot which minimizes equation (2). 
The first knot partitions the units into two sets. For the choice of the second 
internal knot the algorithm tries each observation internal to the two selected 
sets and detects the optimal one minimizing the criterion. Analogously, the 
subsequent knots are located until the fixed number is reached. 

Backward phase: 

It chooses the space dimension for the construction of the PS. This is a trade-off 
problem between the accuracy of the summarization and the risk of overfitting 
related to an increasing space number. The adopted criterion is the Generalized 
Cross-Validation (GCV, Craven and Wahba, 1979): 

GCF(i?) = (4) 

n \-C{R) 
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(for R=p,...,\) where C(i?) = — (c,y)f +dR + \ represents a suitable 

increasing cost function of the space dimension. In the regression context, 
Friedman (1991) motivates the choice of the constant: 2<d <4. Larger values 

for d will lead to fewer knots. 

The final phase: 

The optimal PS is computed on the imivariate transformation of the single 
components choosen by the forward phase and on the base of the imivariate 
spline number (space dimension) suggested by the GCV criterion. The tensorial 
product among the transformed components defines the multivariate spline. 



4. Application 

In this section we compare an example of classical CPCA and PSCA. The aim 
is to study how criminality depends on some population and economic 
indicators (data source: Sole24 ore, italian newspaper of 30-12-1996). 

Figure 1 : The classic representation 
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We investigate on the dependence structure among six predictors and six 
responses. The predictor variables are: income per capita (REDD), bank deposit 
(DEPB), population density (DENS), registered number of employment office, 
under-twentynine age (COLL29) and bankrupt enterprises number (FALL). 
The response variables are: murderes number (murd), car thefts number (car), 
micro-criminality (micr), flat thefts number (flat), minority criminality 
(under 18) and tricks number (trick). These variables are observed on 20 italian 
chief towns of province (Trieste TS, Bologna BO, Aosta AO, Trento TN, 
Roma RM, Campobasso CB, Bari BA, Ascoli-Piceno AP, Venezia VE, 
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Cagliari CA, Firenze FI, Genova GE, Reggio-Calabria RC, Aquila AQ, 
Milano MI, Napoli NA, Palermo PA, Perugia PG, Potenza PZ, Torino TO). 
From the classical CPCA results, we see that the first two eigenvalues furnish 
the 84,3% of explained variance. In coherence the correlation circle shows (fig. 
1) the positive correlation between the response variables under 18, murd, and 
the positive direction of the second axis, while the right side of the first axis is 
strictly related to micr and flat variables. We also see that the predictor 
variables TRICK, DENS and COLL29 have great importance for the 
explanation of under 18 and murd, while DEPB and REDD for micr and flat 
variables. The higher bank deposit, the higher microcriminality. As a matter of 
fact we observe that big towns like MI, RO, BO are principally characterized 
by micr and flat crimes and by DEPB. As well we notice the opposition 
between NA and AQ with respect to the second axis correlated to underl8 and 
murd variables. 

With the PSCA analysis we would like to show that this interpretation does not 
allow to value the different locations of all towns and does not permit to 
imderstand clearly which is the particular variable that influence the towns 
proximities (similarities). 

Differently we will point out the observations on PS representations. 
Representation on Adaptive Principal Surface 

PSCA allows a suitable representation of the dependence structure among 
variables. We construct surface graphs for each of the six responses. 

After computing the principal components (tab. 1), to transform them we use 
zero degree B-splines with 3 different knots. The knots sequence per 
component (tab. 2) is defined on the basis of the criterion (2) and defines the 
grids. The number of components retained for the construction of surfaces (this 
solves the space dimension problem) is two as suggested by the GCV criterion. 
The tensor product of the transformed components computes the multivariate 
spline, i.e. the principal surface. 



Table \ : Principal Components 
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Table 2; Knots sequence 





FI 


F2 


[1] 


-2,4 


-1,4 


[2] 


2,35 


1,93 


[3] 


-0,72 


-0,28 



Regressing the projected variables Y by the tensor matrix we compute the 
coefficients which are the grids levels. 
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For each response, we construct a PS where towns are located and get the 
following groupes: 

1) PZ, RC, NA, AP 

2) PA, CA, PG, TO, TN, BO, MI, RO, FI, BA 

3) CB 

4) GE, AQ, AO, VE, TS 



Figure 2: under 18 variable 
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Furthermore for example the group 1) is detected mostly from the variable flat 
(see fig. 3). In this way, we can evaluate the negative or positive similarities of 
towns on the basis of their location at the grids. 



5. Conclusions 

In this paper we propose a non linear generalization of CPC A. The technique 
called PSCA has been implemented by an algorithm which appears to be an 
useful tool for data reduction and multivariate approximation fimction 
reconstruction. The example shows how principal surface representation allows 
to take into account the structural variables interactions; in particular the 
resulting "tri-dimensional" representation permits to investigate in depth the 
similarities among units due to a particular variable (classification variable). 

An interesting extension of the procedure may concern a priori non linear 
transformation of both original variables and their components (Tessitore, 
Lombardo, van Rijckevorsel 1998) or the knots number optimal detection 
(Lombardo R., D’Ambra L., Tessitore G. 1997) or the robustness of the 
criterion. 
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Abstract: In this paper we propose an extension of the Generalised Canonical 
Analysis to the study of symbolic objects. The aim is to analyse symbolic 
objects on a factorial plan. In the reduced sub-space, the symbolic objects are 
represented by polygons instead of points as in the classical data analysis. This 
kind of representation seems consistent with their original meaning of complex 
information. Furthermore we propose a symbolic interpretation of the factorial 
axes, and an evaluation of the quality of the images of the symbolic objects on 
the factorial plan. 

Key words: Symbolic objects; Generalised Canonical Analysis;^^zy coding. 



1. Introduction 

The main part of the real phenomena can be interpreted starting from the 
relations among the variables by which they are characterized. The symbolic 
analysis has been introduced by Diday (1989, 1991) in order to study complex 
data, which are often more representative of real phenomena than classical data 
(structured by individualsxvariables). 

In the last years, the development of symbolic data analysis has delivered 
many methods for the synthesis and the representation of complex information. 
The techniques proposed for the study of the symbolic objects are both 
symbolic and numerical. The symbolic techniques are typical of the Artificial 
Intelligence field and are based on a matching between symbolic objects of 
several orders. This kind of strategy allows to carry out generalization and 
specialization of classes, here represented by symbolic objects. On the other 
hand, the most part of numerical techniques are derived from classical statistical 
methods. 

Several methods of the Data Analysis are suitably extended to the study of 
this particular kind of data. 

Nevertheless, the application of numerical techniques to S)nnbolic objects 
requires a numerical transformation of the variables that describe the objects in 
classical data matrices. The complex information expressed by the symbolic 
objects is recovered by a symbolic interpretation of the results. 



The paper has been supported by Esprit Project n. 20821 grant: Symbolic Official Data Analysis 
System. The author is grateful to prof. Carlo Lauro for helpful comments. 
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The generalization of factorial methods to symbolic objects is one of the 
research fields in the SODAS project; in this context, works by Gettler-Summa 
(1992); Chouakria et al. (1995, 1996); Lauro et al. (1997) should be mentioned. 
In the context of factorial approaches to symbolic objects, the present work aims 
at providing a particular visualization of symbolic objects on factorial plans and a 
symbolic interpretation of the factorial axes. 

The main features of this work concern: symbolic objects definition; 
numerical transformation of the descriptors m fuzzy and binary coding matrices; 
extension of the Generalised Canonical Analysis to fuzzy coding data; graphical 
representation of the symbolic objects on a factorial plan; symbolic interpretation 
of the factorial axes; an application to real data. 



2. Symbolic object deHnition 

Let Q be a set of individuals w called elementary objects and Y={yi,y 2 ,...,yp} 
be a set of numerical and nominal descriptors, with domains Oj’s. 

The most elementary symbolic object, called event, is denoted by ej=[yjeVj], 
where Vj c Oj is the set of values of yj describing the symbolic object ej. A 
conjunction of events is defined symbolic assertion object: 

a = Ajej = Aj=i,.,.^[yj e Vj] 

A logical function defined on a set Q is associated to a so that a(w)=true if 
yj(w)e Vj, Vj. The extension of a, denoted ext{dJ€l), is the set of elements weQ 
satisfying the condition a(w)=true. 

Each assertion is described by multi-nominal, ordinal and at intervals 
continuous variables. Such variables are related among them. 

Further information, given by probabilities, occurrences or beliefs can be 
associated to the categories of nominal variables. This kind of descriptors are 
called modal and the mode is the probability, the occurrence or the belief. 

Moreover, the space of the symbolic objects descriptors can be reduced by 
logical constraints on the variables. Logical rules can induce the no-applicability 
(NA) of a variable, or of some categories, if another variable takes some values. 
Constraints can also expressed by taxonomic structures on the categories of 
nominal variables. 



3. Numerical coding of the symbolic object descriptors 

Factorial techniques have been proposed in order to analyse the relationships 
among the symbolic objects in reduced dimension sub-spaces. 

The application of a numerical approach to symbolic objects requires their 
transformation in classical data and a successive reconstruction of the global 
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information expressed by the objects, interpreting the results of the analysis in a 
symbohc way. The first phase consists of a suitable coding of the p variables, 
according to their nature, in classical matrices Zj (j=l,...,p). In particular, the 
nominal variables are coded into binary matrices (0/1). More categories 
presented by an object are coded into more rows of the coding table Zj. If 
frequencies or probabilities are associated to the categories of a nominal 
descriptor, these values constitute the coding values of the descriptor. 

We suggest a fuzzy coding of the quantitative descriptors (discrete and at 
interval) in order to retain the numerical information after the categorization of 
the numerical variables, and to take into account the variabiUty of at interval 
variables. In this context, we propose piece-wise polynomial functions, the Basic 
splines, for a fuzzy coding of numerical variables. The most used coding 
functions, B-sphnes of degree 1, or semi-linear functions, identify three 
categories of each numerical character (e.g.: low-medium-high), and assign each 
value that can be taken by an object to these categories, with values in [0,1]. 
According to this kind of coding, the two bound- values of a continuous variable 
are coded into two diffei;pnt rows. 

In order to take into account the original relationships among the variables, 
we propose to relate the rows of the coding matrices, corresponding to each 
symbolic object, by means of their Cartesian product. 

In this way, some rows of the coding matrices Zj (j=l,...,p) are duplicated and 
their increment depends on the number of symbohc objects, the number of 
continuous at intervals variables and the number of categories of nominal 
descriptors. Therefore, the total number of the rows of the coding matrices Zj 

h 

(j=l,...,p) is equal to N=5x2^ x , where S is the number of symbohc 

7=1 

objects considered in the analysis, q is the number of at interval variables, and kj 
is the number of categories of the j-th multi-nominal variable (j=l,...,/i). With 

H 

K= 3xq+ 'Zkj we indicate the total number of the categories of the coded 

7=1 

descriptors, which are given by the sum of the three fuzzy coding categories of 
the quantitative variables and the kj categories of the multi-nominal variables as 
well as of the modal variables (j=l, with H>h). 

From the juxtaposition of the binary and fuzzy coding tables: [Zil...lZjl...lZp] 
we obtain the global coding table Z, of dimension (NxK). 

Furthermore, in presence of dependence rules, the constrained symbohc 
objects can be reduced to a set of sub-objects, which are described by the values 
of the original variables consistent with the rules (Verde, 1997). Thus, the 
symbolic sub-objects can be interpreted as specialized symbohc objects. 

They are treated in the analysis in the same way than others symbohc objects. 
In the coding phase, the category NA {non-applicable) is added up to the other 
coding categories of the variables, which are made unfeasible by logical 
conditions. 
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4. Generalised Canonical Analysis on symbolic objects 

From a geometrical point of view, the rows of the coding table Z, relative to 
each symbolic object, correspond to the vertices of an hypercube in the space 
of the categories of the coded variables. 

A factorial approach based on the Generalised Canonical Analysis (GCA) is 
performed in order to visualize the relations among the hypercubes in a sub- 
space of reduced dimensions. In particular, we consider an extension of the 
classical GCA to the binary and. fuzzy coded data matrices. 

The proposed factorial analysis aims at decomposing the total inertia of the 
symbohc objects, i.e. the total inertia of the relative hypercube vertices, on 
factorial axes. Thus, as it is known, it searches for the axes of synthesis Vy in 
each sub-space £), spanned by the column vectors of the coding matrices Zj 
(j=l,...,p), and the orthogonal vectors as global synthesis of all Vj. The 

p 

criterion optimized is the average multiple correlation ratio: where 

7 = 1 

Vj and ^aare normalised to 1. 

Let Pj=Zj(Zj’Zj) Zj be the projection operator associated to the sub-space £) 
and P=Zj Pj. The maximum inertia axes are obtained maximizing: 

N-P]=i ' J N-p 

that is equivalent to solve the following characteristic equation, under the usual 
orthonormality constraints = 1 and = 0 with a^a' : 

-^ZE"‘Z'^=X„^„, {a=\,...,M-,M=min[N-l,K-\]) (2) 

where: is a block diagonal matrix of elements (Zj’Zj)”' (j=l,...,p). 

All vertices v, (s=l,...,N) of all hypercubes are projected on the factorial plan, 
according to the classical formula of the coordinates of the individuals on the 
factorial axes: 

coora(Vj)=ZZ“^Z'^ , (3) 



5. Visualization of the symbolic objects on factorial plan 

The analysis of symbohc objects by means of an extention of a classical factorial 
method requires an interpretation of the results in symbohc terms. Therefore, in 
order to display the objects consistently with their theoretical definition of global 
and unitary information, the projected vertices of each hypercube should be 
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collected in an unique geometrical figure. Different kind of geometrical images 
can be proposed. The choice of the peculiar figure depends on two main 
considerations: an easy comparability among objects and the best recognition of 
the original symbolic objects by their deformed projections on the factorial plan. 
The first problem leads to visualize a symbolic object by means of the maximum 
covering rectangle of the projected hypercube vertices, according to the 
representation already proposed in other factorial approaches on these complex 
data (Chouakria er a/., 1995; 1996). 

The length of the rectangles sides, representing the symbolic objects on 
factorial plans, is proportional to the variability of the descriptors that have 
highly contributed to the orientation of the factorial axes. 

Nevertheless, this kind of representation is not always suitable to recognize 
the original symbolic objects. In fact, the vertices enveloping takes into account 
merely the vertices with minimum and maximum factorial coordinates, and it 
does not consider the spread of all its vertices on the plan. Therefore, the 
rectangles furnish over-sized representations of hypercubes with respect to the 
real surface recovered by the point-vertices. The first consequence deriving from 
this visualization way is to represent objects of different original dimensions by 
rectangles of the same dimensions. 

An alternative, more suitable symbolic objects visualization seems to be a 
convex enveloping of the external vertices of each hypercube, which allows to 
consider the spread of the relative vertices. 

The main advantage of this kind of representation is to allow a better 
recognition of different symbolic objects. Furthermore it furnishes an 
interpretation of the symbolic objects considering the variability of the vertices 
of the hypercubes along the directions of the factorial axes. 

According to a symbolic objects visualization by rectangles, we propose to 
evaluate the quality of the projected hypercubes, on the factorial plan, as the 
average of the quality of the single segments linking the rectangle generator 
vertices. On the other hand, in the representations of the symbolic objects by 
convex envelops, the quality of the objects images can be evaluated by the 
quality of the internal diagonals of the hypercubes. 

Moreover, in this context, an interesting interpretation of the factorial axes in 
symbolic terms can be given by symbolic assertions associated to the axes. The 
descriptors of the axes are the categories of the original p variables that have 
most contributed to the determination of the factorial axes (see e.g. in section 6). 



6. A numerical example 

An application of the proposed approach is performed on a set of data about 
the life quality in Italian cities (font: II sole 24 ore, December 18, 1995). From 
these data, we have obtained 9 symbolic objects which are representative of the 
cities having different dimensions (L-“large”, M-“medium”, S-“small”) in the 
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North, the Center, the South (including the Islands) of Italy (areas indicated by 
N,C, and S). The cities are characterized by 5 economic-demographics 
indicators, considered as at interval variables - I: income per capita', U: 
unemployment rate; M: micro-criminality; T: road traffic; C: cinema and theater 
expenditure per capita. The coding of each descriptor is realized by means of 
three B-splines of degree 1, corresponding to semi-linear functions. The coding 
functions identify the categories: low-medium-high of each indicator. 

Each object is codified into two rows of the coding table associated to each 
descriptor. The vertices of the hypercubes associated to each object are equal to 
2^, which correspond to the number of rows of the coding table Z related to 
each object. The total number of Z rows is equal to N=8x2^ and the number of 
columns is 3x5=15, having been coded the descriptors with respect to 3 
categories. 

In the fig.l the eight hypercubes are visualized as the maximum covering 
rectangles. The coordinates of the categories of the variables are represented too 
on the factorial plan. We observe that the first axis opposes the cities of the 
South area to the Northern ones, while the second axis opposes the large to the 
small dimensions cities. The dimensions of the rectangles, representing the 
symbolic objects on the factorial plan, are proportional to the variability of their 
descriptors. 




A suitable interpretation of the factorial plan (explained inertia=62%) is given 
by a description of the axes in terms of symbolic assertions, where the values of 
the descriptors are chosen on the basis of the contributions of the categories to 
the orientation to the factors. Therefore, the symbolic assertions associated to 
the axes 1 and 2 (positive “+ ” and negative side) are the following: 
ai+= [income per capita= { low } ] A[unemployment rate= { high } ] 

[micro-criminality={high } ] a [cinema/theater expenditure p.c.={ high } ] 
a 2 += [micro-criminality= { low } ] a [road traffic= [ medium } ] 

02 - = [road traffic= [ high } ] A[micro-criminality= [ high } ] 
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According to this interpretation, the small cities of the South Italy (SS) are 
strongly characterized by low income per capita and high unemployment rate 
(descriptors of ai+), they are opposite to the large cities of the North (NL) where 
the expenditure p.c for cinema and theater is higher (descriptors of aj.). The 
small cities of the North (NS) are well described by low micro-criminality and 
low road traffic (descriptors of U2+), while the large cities both the North (NL) 
and the South (SL) are characterized by high road traffic and high micro- 
criminality (descriptors of 02.). 

It is worth noting that the rectangles representing the cities of the Center area 
(CS, CM and CL) are located more in proximity of the origin of the axes than 
the images of the others cities. This means that such cities assume some values 
of their descriptors near to the average of the values taken from the other 
objects. A particular case is given by the image of the large cities of the Center 
represented by a rectangle completely included in the image of the large cities of 
the North area (NL), presenting, at the same time, a little overlap with the image 
of medium cities of the North (NM) area. This means that the large cities of the 
Center (CL) area are characterized by the same values of the variables describing 
the large cities of the North Italy (NL), even if the last ones have an higher 
variability of the descriptors than the Center cities. 

Figure 2 : Visualization of the symbolic objects by convex envelops 



Axis 2 




0 5 10 1 

Axis 1 



An alternative description of the symbolic objects is given by convex 
envelops. In the figure 2 , it is shown a section of the factorial plan in which only 
the cities of the South area are represented. In particular, we observe that the 
images of the large and the medium cities are overlapped when rectangular 
representations are used while they are distinct using convex enveloping 
visualizations. The large overlapping between the medium and small cities 
images has similarly been reduced by visualizing the objects with convex 
envelops. Thus, this representation allows a better discrimination among objects 
than the rectangular one. 
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7. Conclusion 

In conclusion, classical factorial approaches seem to be suitable techniques to 
visualize the relationships among the objects on reduced sub-spaces. They allow 
to consider the variability of the descriptors, proportionally to the dimensions of 
the symbolic objects images and they furnish an interpretation of the objects 
according to the symbolic meaning assumed by the factorial axes. 

Furthermore, the visualization of the objects by convex envelops permits a 
better discrimination of different symbolic objects on the reduced sub-spaces. 
Alternative suitable ways to visualize symbolic objects, as elUpses or principal 
diagonals, can be proposed, in order to recognize the symbolic objects with 
respect to their original dimensions. 

Another aspect, to be considered, concerns the global interpretation of the 
sub-objects in which the symbolic objects are decomposed by logical rules. In 
fact, on the factorial plan, these sub-objects are visualized independently of their 
original belonging to an unique object, either if a rectangular or convex 
enveloping visualization is used. 
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1. Structural Model with Qualitative Variables 

The structural model for the study of causal relationship among latent 
variables is composed of one structural and two measurement equations: 

H^HB+Sr+E=SIJI-B)-i+E(I-B)-K- Y=HAy+U; X=SAx+U (1) 

where Y'=(y’(i),...,y’(t))(ny,t), X'=(x’(i),...,x’(t))(nx,t) are the observed mixed 
variables; S’=(4’(t),-,4’(t))(n^t), H'=(rj’(i),...,T]’(t))(nr^t) are the latent 
variables; E'=(e’(i),...,£’(t)) (ns,,0 are the errors in equations; A’=(5’(i),...,5’(t)) 
(nst), U'=(u ’( 1 ’(tp (nu,t), are the errors in variables. It is assumed that: all 
the random variables have zero mean and finite variance, R is a low matrix with 
zero on the main diagonal, (Y, X, H) dXQ identically distributed and (S, E, A U) 
are identically and independently distributed. The model is usually proposed 
with restrictions on parameters and on covariances. 

The solutions are reached starting from the variance covariance matrix of the 
reduced model where the variables H in the measurement models are 
substituted by the value of (SIJI-B)'^ +(I-B)'^ E) obtained firom the structural 
equations. 
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2. Methods for Obtaining Solutions 

First of all given that a continuous bivariate normal variable with 

distribution underlies every pair of ordinal bivariate 

observed variables (Ji,J„), the polychoric correlation between the two 
components of them is calculated. When there is one only ordinal variable, the 
polyserial correlation coefficient between an observed quantitative variable and 
the normal imderlying variable is defined. The observed fi’equencies of the 
qualitative variables (q=l,...n,;r = l,...,n^) given, the polychoric 

coefficients are reached in different ways. Joreskog (1994) hypothesizes that the 
marginal probability of the normal variables are equal to the marginal 
fi-equencies of the two-way table J (^i ,=n, .) of the ordinal 

variables. Therefore, first of all, he reaches the thresholds, using the marginal 
distributions of the normal bivariate distribution; for given thresholds using 
such distribution, he obtains the correlation coefficient maximizing the log 
likelihood of the sample respect to /?, . .»• Lee et al. (1990) estimate the 

[Jl ,Jm) 

thresholds by means of Partition Maximum Likelihood (PML) which is simpler 
from the computational point of view. 

Lee et al. (1995) estimate simultaneously the correlation coefficients and the 
thresholds concerning pairs of variables maximizing the log likelihood of the 
sample respect to ^nd respect to every threshold , by means of 

PML (but Joreskog (1994) observes that “different estimates of thresholds for 
one variable may be obtained from different pairs of variables”). 

By means of Full Maximum Likelihood, in one case Lee et al. (1992) 
simultaneously reach all the thresholds of the polychoric correlations; in 
another case (Lee et al. (1990)), they reach also the parameters, the variances 
and the covariances of the latent variables. Moreover these two last methods 
take up too much computer time (Lee et al. (1990) Lee et al. (1995)). 

In the first three methods, the parameters and the covariances of the latent 
variables and errors are obtained from the polychoric or polyserial correlations. 
There is not a unique imderstanding about the method of obtaining the 
parameters and the latent variables. Joreskog (1990), (1994), Rigdon and 
Ferguson (1991) propose the Weighted Least Squares method and criticise the 
Maximum Likelihood method because the standard error parameter estimates 
are asymptotically incorrect, but Lee et al. (1995) continue to prefer the General 
Least Squares method and criticize the proposal of Joreskog because it requires 
sample sizes larger than 200 and more computer time. 
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3. Some Critical Observations 

Comparing, in some Montecarlo studies, different correlation coefficients when 
the underl 5 dng bivariate distribution is normal, the polychoric correlation is 
shown to be “the best in the sense of being closest to the true correlation” 
(Quiroga (1992)) [even if in a Montecarlo study Babakus et al. (1987) show that 
the polychoric coefficient provides the best estimates of model parameters and 
the worst fit statistics]. However, Muthen (1984), Aish and Jdreskog (1990) and 
Quiroga (1992) say also that “the assumption of underlying bivariate normality 
is too strong for most ordinary variables used in social sciences”. There are not 
many studies on the use of polychoric coefficient with imderl)dng not normal 
distribution. Lee and Lam (1988) study the robustness of polychoric coefficient 
only when the underlying distribution is elliptical (containing multivariate 
normal, platicentric and leptocentric distributions). Lee et al. (1995) even 
though obtaining quite satisfactory results about such robustness with moderate 
size random samples, say that “to draw a non definite conclusion, a longer 
simulation is needed”. Moreover in literature only Quiroga (1992) tries, in a 
systematic way, to extend the distributive hypothesis of the continuous 
variables imderlying the qualitative variables. First of all, she says that when 
you leave the normality assumptions, such variables are surely not consistent 
and slightly biased (Quiroga (1992)). Moreover, in a Montecarlo study she 
verifies that, when the underlying distribution is a skew-normal bivariate, the 
polychoric coefficient is the best choice only for sample size of 200-400 and 
large number of categories (5-9) and imderestimates the true correlation 
coefficient. Then, by means of a measure of not normality, she verifies that 
when the imderlying distributions are generated by the Fleishman-Vale- 
Maurelli pol)momial transformation (with departure fi'om normality due to 
skewness and kurtosis) the polychoric coefficient is robust, but overestimates 
the true correlation. Finally she proposes an extended polychoric coefficient 
with distribution given by a mixture of a normal and univariate skew-normal 
density function but she does not give empirical verifications. Therefore, until 
now, there are neither theoretical demonstrations nor empirical simulations 
which give satisfactory and generally valid reasons for using polychoric 
coefficient with underlying not normal distribution. 

Moreover the solutions based on polychoric coefficients: are not sensible to 
different scale of qualitative variables because they generally deal only with 
ordinal variables, reach solutions from variance covariance-covariance matrix 
of the reduced model different from the solutions obtained from the observed 
variables of the original measurement models and have the same problems of 
non identification of parameters and indeterminacy of latent variables of 
structural models with quantitative variables (Vittadini 1989). 
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4. An Alternative Proposal 

In the quantitative case the problems of not uniqueness of the solutions are 
resolved by using linear combinations instead of causal latent variables of the 
observed variables (Wold (1982), Haagen and Vittadini (1991)). In this paper 
we propose to obtain the latent variables of the model as linear combinations of 
the observed mixed variables simultaneously quantified, by means of methods 
of multidimensional scaling using simultaneously optimal scaling and ordinary 
least squares method. In fact such methods resolve the problem of not normality 
of the variables because are distribution free (Young (1981)) and give unique 
solutions once chosen the method of multidimensional scaling. In order to avoid 
subjective choices about methodologies of multidimensional scaling (and 
therefore subjective solutions), we propose: to quantify the qualitative variable 
and to obtain the linear combinations of them by means of a unique objective 
function; to reach flexible solutions as regards to different kinds of linear 
combinations requested by the problems; to take into account the different scale 
of the qualitative variable. Therefore, among the variety of multidimensional 
scale methods we choose the family of Alsos method (Young (1981), and 
Keller and Wansbeek (1983)). These methods are based on alternating optimal 
scaling which quantifies qualitative variables and ordinary least squares which 
reach linear combinations of them in a iterative way. So they obtain solutions in 
a different way along the scale of the variables (ordinal-nominal, continuous- 
discrete), and the aim of analysis (e.g. Principal components, canonical 
correlation) giving answers to previous problems. Among the family of Alsos 
methods we avoid methods such as OSMOD (Saito and Otsu (1988)) or 
INDOMIX-CAMIX (Kiers (1991)) which obtain solutions in two stages. 
Instead we choose methods that simultaneously obtain quantifications of 
qualitative variables and their linear combinations (ADDALS (De Leeuw, 
Yoimg, Takane (1976)) MORALS CORALS (Yoimg, De Leeuw Takane, 
(1976)), PRINCALS ( De Leeuw and Van Rijckevorsel (1980)) OVERALS 
(Van Der Burg and De Leeuw (1988)) respectively from the perspective of 
variance analysis, canonical correlation, principal components, multiple 
correspondence analysis. Moreover in order to obtain the latent variables from 
their real indicators as in Wold (1982), we apply the chosen Alsos methods to 
the subsets of mixed variables Y^.Xg characterized by submatrices 

with coefficients all different from zero. So we simultaneously obtain the 
quantification Y*p,X"‘g of such mixed variables Yp,Xg and their linear 

transformations rjp, according to different aim of the analysis (e.g. canonical 
correlation, principal analysis etc.). Then in order to take into account the 
restrictions: 
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cov[rip, 77 ^) = 0;cov(4g,<^j = 0;cov^Sp^,SpJ;cov^Ug^,UgJ = 0; 

\p,fi) ~^>y (6,ip) ~ ^y(fi.a) ~ ^x(S.v) ~ ( 2 ) 



we obtain by means of the Restricted Regression Component Decomposition 
(RRCD) of quantified variables Y',X' (Haagen and Vittadini (1998)) by means 
of an iterative process: 



Jifl = v°p(^l= Qn. ip * n, = nl ;* y« = Qh,„ Va ) 

( ^1= (r ^ S);* % = ^ 

f = ~nj ^ (3) 



^J = 



where are the H° without rf^, rfp, is the complement orthogonal to 

the orthogonal projector on the space generated by and the other symbols 
are defined in a similar way. So we have the following RRCD of yp^ ,xg^ , rfp : 



Qs!‘Kjyi/ifyp, -^ 1 ^‘yp, 

'^°P + P!°P + Qh ^°P 



5. Numerical Example 

The following variables are observed on a sample of 150 families casually 
chosen fi'om 4103 american families that have been codified in the Federal 
Reserve Board research regarding National Income and Wealth of 1983. 

Yj Job contract household (y,i), spouse (yjj); Occupation kind household (y,3), 
spouse (yiJ; Occupation sector household (y,;), spouse (yj^). Yj Total health 
(y2i); Income (y22); Debt (y23). Xj Age household (x,i), spouse (X12); Sex 
household (x,3), spouse (x^); Number of children (Xjj); Race (x,g); Residence 
region (x,7); Civil Status (x,g); Xj Educational Level household (X21), spouse 
(X22); Full time job years household (X23), spouse (X24) ; Part time job years 
household (X25), spouse (x2«); Latent variables: labour force (q,); Health and 
income (qj). Civil status (^1), Instruction grade (^j)- 

In order to verify the causal dependence of the latent variables H fi'om the latent 
variables S we use the alternative proposal shown in paragraph 4 with the 
following restrictions: L = 0 ,L = 0 ,L = 0,1 =0, cov(A,A,)=0, 

cov(I7,,I72) = 0 .The qualitative variables are quantified and the latent variables 
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are obtained as principal components with Princals method, the restrictions are 
then taken into accoimt by means of RRCD. The variance-covariance of the 
observed variables and the results are shown in table 1. 

Table 1 : The alternative proposal for the sample of American families. 
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With the example we can verify that the alternative proposal respects all the 
properties of the structural model described in paragraph 1 and the restrictions 
indicated in this paragraph. But the alternative proposal obtains unique 
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solutions solving all the problems of non-identification of parameters and 
indeterminacy of latent variables of the structural models with qualitative 
variables. 
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Abstract: In this paper we describe an exploratory technique based on the 
diagonalization of cross-variogram matrices. Our aim is to describe the 
behavior of a multivariate set of spatial data in a dimensionally reduced space 
in such a way that the information on the spatial variation is preserved. 
Furthermore we propose a definition for the range of «variograms» in the 
multivariate case. Simulation studies and an application to botanical data 
collected on line transects are reported. 

Keywords: Cross-variogram, Principal component analysis, singular value 
decomposition, spatial data. 



1. Introduction 

In the analysis of multivariate spatial data several difficulties arise. Because of 
the nature itself of the phenomena under study, we cannot assume independence 
between observations; instead we often want to model the specific type of 
dependence we observe. We usually model these data-sets as generated from a 
multivariate spatial random field (MSRF) under given assumptions on the type 
of relationship existing between locations and variables in each location. 
However before any serious attempt to model building can be made, we have to 
explore as deeply as possible the available data-set. This is very difficult 
because of the high dimensionality of the problem. Standard multivariate 
exploratory techniques like principal components analysis (PCA) and 
multidimensional scaling, usually fail to give us a good representation of this 
kind of multivariate data-set as they do not allow the information on the spatial 
arrangement of observations to be included. 

The literature on the treatment of multivariate spatial data mostly deals with 
geo-statistical techniques (Borgman and Frahme, 1976, Joumel and Huijbregts, 
1978, Wackemagel, 1995, Christakos 1992). In Davis and Greenes (1983) PCA 
is applied to the sample correlation matrix in order to produce co-kriging results 
by computing a series of univariate variogram on a "weight" matrix, whose 
rows are uncorrelated, their purpose being to provide an orthogonal basis for 
the variables, so that orthogonality holds spatially as well as in each considered 
location. In Di Bella and Jona Lasinio (1996 and references therein) some 
ordination methods in which the spatial information is included are described. 
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most of them oriented to pattern recognition in botanical studies, where great 
attention is given to reconsider a posteriori the dimension of sample units 
through the detection of the real scale of the data. 

In Tailiang Xie and Myers (1995a, 1995b) a very interesting proposal is 
developed. The authors illustrate a technique to determine an orthonormal 
matrix which simultaneously diagonalizes, at selected lags, the cross-variogram 
matrices (or nearly diagonalizes it in the sense that the sum of squares of off- 
diagonal elements is small compared to the sum of squares of diagonal 
elements). In this paper, following the same line of thoughts, we describe an 
exploratory technique, based on the spectral decomposition of cross-variogram 
matrices, that allows us to represent our multivariate set of data in a more 
parsimonious manner keeping all the available spatial information along the 
analysis. As a tool for exploring the global behavior of the spatial variability, 
we propose a multivariate extension of the idea of the variogram range. 

The choice of cross-variogram matrices is motivated by the fact that cross- 
variograms are symmetrical and non-negative definite matrices that can always 
be defined even when the usual assumptions of second order stationarity cannot 
be made. 

2. Principal Components Analysis of MSRF 

Formally our multivariate data-set is thought as generated by a ^ dimensional 
(second order stationary or intrinsic stationary of some given order (Matheron, 
1973)) multivariate isotropic spatial random field (MSRF) X(s)=(Xi(s),..., 
Xk(s))^, for all seD, with seD where D is a finite partition of, say, n sites (or 
locations) of a given region. We assume that for each value of seD 
observations of the whole MSRF are available. We denote by C(0) the kxk 
covariance matrix between the process components evaluated in each site and 
by C(h) the cross-covariance matrix of the field evaluated between sites Si,Sj€ D 
such that d(si,Sj)=h is the value of the distance between them. Under second 
order stationarity assumption (we'll deal with the intrinsic case later on) we can 
define the set of cross-variogram matrices (see for instance Cressie, 1991 and 
references therein) as: 

2r(/i)=Var(X(5,)-X(5>)), (1) 

For all pairs Si,SjeD, h=d(Si,Sj) with h=hj,...,hm. 

In T{h), information on the spatial variation of the MSRF at distance h is 
represented. As it is well known through PCA we are able to reduce a set of 
correlated random variables into an uncorrelated set by an orthogonal 
transformation. The aim of the transformation is to reconstruct a k dimensional 
random variable U by p<k linear functions without much loss of information; 
then we seek the best linear predictor based on that p functions. The efficiency 
of prediction may be measured by the residual variance and an over all measure 
of the predictive efficiency is the sum of such variances. If we want to include 
the information available on the spatial behavior of the RF in these p linear 
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functions, we have to build them starting from a measure of the spatial variation 
itself, i.e. the set of matrices given in (1). We can proceed as in Tailiang Xie 
and Myers (1995a, 1995b) by finding a matrix that nearly diagonalize the m 
cross-variogram matrices. Then, by linearly transforming the original data into 
almost spatially un-correlated vectors, we can treat each variable separately. 
However as this procedure involves some complex and computationally 
expensive calculations, we propose to proceed in a rougher, but informative 
way. 

Being our main interest the study of the joint variation in space of the SRF for 
all value of the distance between sites, we have to synthesize the information 
contained in the m cross-variogram matrices. The most natural choice in this 
direction is to build a synthesis matrix and apply it PCA. We build the 
following s 5 mthesis matrix: 



r=£r(/i) (2) 

hn 

Other choices of synthesis matrix are possible. For example r=^r(/i)/m 

w 

(which is equivalent to (2)) or, using some previous knowledge about the 
considered phenomena, we can build a weight system {w(h): Eh w{h)=\, 
h=hi,...,hm.} and compute F=Eh w(h)r(h). Our method remains valid for all 
choices of F. 

Through the spectral decomposition of F we find the orthogonal transformation 
B that minimizes 



hn 

Trace (F-FB(B’’FB)-‘b'^F)=X trace (F(/i)- T(h)B(B^rBy^B^r(h)) - 

fH 

hn 

-2^ trace (F(/i)- F(/i)B(B'^FB)'‘b'^F(/)) (3) 

¥i 



As F can be seen as a global measure of spatial variation, (3) can be seen as an 
over all spatial residual prediction variance. 

Further considerations have to be made. In the univariate setting it is of great 
interest to evaluate the range (h*) of the variogram, as we can consider 
observations taken at locations, say, s„ sj, such that d(Si, sj)>h* almost 
uncorrelated. Using F we can identify a multivariate equivalent of the 
variogram’ s range . More precisely we write (2) in terms of its eigenvalues and 
eigenvectors, i.e. F=QAQ^ where Q is the k x k matrix of its normalized 
eigenvectors and A is the diagonal matrix of its eigenvalues. Then we can 
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decompose each eigenvalue of F in terms of contributions given to its value by 
each chosen lag h: 

h=q^T qi=qi^Xr(/i) qi= X qi"'(C(0) -C(/i))qi (4) 

/*=1 hi 

then we can define: 

;ii=arg maxh qJ(C(0) -C(/i))qi (5) 

The first eigenvalue takes into account the largest amount of the variability of 
the SRF, hi* may be seen as a global range for the cross-variogram of the SRF, 
i.e. the distance such that our observations are maximally uncorrelated. In fact 
under second order stationarity assumptions we have F(A)= C(0) -C(h) then 



/ii=arg maXh qi\C(0) -C(/i))qi = arg minn qi^C(/i)qi 



In other words we define the range of the cross-variogram to be the range of 
the univariate variogram of the first principal component obtained from the 
singular value decomposition of F. In general we can define as many ranges as 
many components we compute. More precisely, let Xi(‘y/)=Qi^X(5^) be the i-th 
principal component, its variogram is given by: 

y^(h)=E(Xi(s)- Xi(s+h)f= E(qi^X(5)- qi^X(5+/i))'= qi'"F(;i)qi (6) 

Notice that the cross-variogram of the x’s are given by Yxx(^)“ qi^n^)qj and 
2hYxx(^)= being principal components orthogonal to each other. 

In what follows we usually choose h\* as global range, as the first eigenvalue 
accounts for the largest part of the whole spatial variability. However other 
choices are possible, we can compute p<k components and take the largest 
value of hi* (i=l,...,p) to be the global range. But this choice may be 
misleading, as the corresponding h* may be due to, say, some secondary 
spatial pattern in the field. Let us clarify this last remark with an example. We 
consider a data-set of botanical data where, along a transect of 30 sites, the 
coverage of 5 species of plants are observed. The data are organized in a 5x30 
matrix and the environment suggests that a second order stationarity assumption 
is reasonable. From previous study, we already know that the association of 
three of the five species induces a primary pattern in the vegetation which 
influences blocks of 3 sites at a time (one site is a square of Im?), a secondary 
pattern is induced by the antagonistic behavior of two species (the presence of 
one species nearly excludes the presence of the other one) and a third pattern is 
due to the spread presence of a third rare species and it influences blocks of 8 
sites. Applying the proposed method and considering the first 3 principal 
components (the eigenvalues accounts respectively for the 54%, 21% and 17% 
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of the total variability), we have h\*=3m, /i2*=lm and /i3*=8m. If we choose 
/i*=8m we would be driven to think that the whole spatial arrangement of the 
observed set of plants is given by a rare species. 

Once we have the singular value decomposition of F we can, as usual, 
study the MSRF using only few principal components and this allows us to 
explore the behavior of the SRF using the same tools developed in the 
univariate case. For example we can use the principal components Xi to perform 
ordinary kriging. 

2.1 Intrinsic stationarity and large values of 

The global range we just defined can be used in several ways, a possible one is 
to use it to detect intrinsic stationarity. More precisely, analogously to the 
univariate case we expect that if our MSRF is intrinsic of order zero, the first 
principal component given by the previous analysis has a variogram with non 
finite range. However some caution is necessary. Consider a MSRF observed 
over n locations arranged according to a regular structure, for instance a line 
transect. By applying our technique we find a value of hi* close to hm, for 
instance in the botanical example hm=29. In this case two conclusions can be 
driven: the SRF is second order stationary and the global spatial autocorrelation 
decays for large values of the distance between sites, or the SRF is intrinsic 
stationary and the global range is non-finite. A way to choose between these 
two conclusions is to observe hiore locations along the transect. In the next 
section we’ll deal again with this problem through several simulated examples. 

3. Simulations and Application 

We conducted extensive simulation studies of this method (Capobianchi, Jona 
Lasinio 1997 unpublished manuscript, Capobianchi 1998) for both line 
transects and grid samples. Here the most representative simulations on line 
transects are described. 

We based our simulations on the following model (Capobianchi, 1998): let first 
consider {X(s),s€D}a vector valued second order stationary (isotropic) SRF 
with, say k components and \D\=n. It is easy to show that we can write each 
component of the SRF as a linear combination of totally uncorrelated random 
variables (r.v.) ssD: 

m=i r=i 

if and only if its cross-covariance can be written as 

Qp(/l)=X /mtim V^'"^(/l) (8) 

m4 

where the coefficients of the linear combinations (7) and (8) are obtained by 
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the following considerations. We assume that there are two main sources of 
variation, one is due only to the interaction between the components of the 
SRP, the other is due to the spatial arrangement of the RF. The two are 
combined as follows. 

• To each component of the multivariate SRF (MSRF) a spatial structure is 

given as if no correlation between the process components is present. That 
structure is represented by an n x n matrix (i=l,..., k) of spatial auto- 
covariances. We take its spectral decomposition ) and we 

denote by and rirj its eigenvalues and the elements of its eigenvectors 
(r,j=l,...,n; ,i=l,...,k) respectively. 

• To each site sje D we assign the same correlation structure between the 
process components (when no spatial correlation is given) and we denote the 
corresponding covariance matrix by C*(0) whose spectral decomposition is 
C*(0)= L T (and then Im and Tim, m,i=\,...,k are its eigenvalues and the 
elements of its eigenvectors). 

The MSRF (7) is simulated through the following steps: 

a) We fix C*(0) and we generate Z from a k \ n Gaussian distribution with 
zero mean vector and unitary covariance matrix.; 

b) We define Y(i)^=Z(i7B(i) where B(i)=S(i)*^^^(i). Then for each fixed site sj we 
have a A:-dimensional r.v. with uncorrelated components, and each 
component Y(i)( 5 )) for j=l,...,n has spatial structure given by 

c) Finally we compute X(s) =L*^^7Y(s). 

In order to simulate an intrinsic MSRF we proceed as above, the only difference 
being in the way we build matrices. In this case we use the generalized 
covariance matrix defined in Matheron (1973). Let be the generalized 
covariance matrix associated to the i-th variable, then V^'^=PK^‘^P, with P=I- 
lnln^/«. I the n X n identity matrix and In is an n x 1 vector of unitary elements. 
The generalized covariance matrix has elements Kjw^'^=-y'\/i), (h=d(sj,Sw), 
j,w=l,...,n). Through out the simulations we used several variograms models 
over transects of n=20,30,50 (65 for the intrinsic SRF) sites. The MSRF has 10 
components (^10). In our study we distinguished between "low" (almost 
diagonal local covariance matrix), «medium» and high correlation among the 
process components. In this paper simulations with «medium» local variability 
structure are shown. We further distinguished between variograms models with 
small and large ranges, meaning that in one case the sill is reached for a small 
value of the distance between sites and in the other it's reached for a large one 
{small and large have to be seen w.r.t. the transect's length). We take into 
account values of the distance h between sites up to hm-h„J3\ as not 
enough observations are available to estimate the variograms matrices for 
higher values of h. Then the largest distance values we consider are h^=42 when 
n=50, h^=25 when n=30 and /i°=15 with n=20. In all examples we simulate 500 
samples of line transects of n=50 squares. For each simulated sample we 
compute the matrix F and we decompose its eigenvalues according to the 
proposed method. Then in order to verify the influence of the sample size on 
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the procedure, we repeat the analysis on the same simulated data truncating the 
transects at n=30 and n=20 squares. 

The simulations collected in Table 1 have been performed as follows: 

1) Sim.1 In this scheme we fix a Chauchy variogram model (see for instance 
Cressie 1991, Wackemagel 1995) with parameters a=l, b=5 and o=l for the 
components of the SRF with odd index (i=l,3,5,7,9). To the remaining 
variables we associate a spherical variogram model with parameters a=l 
and b=5. The two models reach the sill for small values of the distance 
between sites Qi*<\6 for each component on line transects of 50 locations). 
The variability between the MSRF components is fixed. 

2) 5>im.2 We simulate large range variograms (A*>24 for each component on 
line transects of 50 locations). To 5 components we associate an exponential 
variogram model with parameters a=l and b=\5 and to the other 5 a 
spherical model with parameters a=l and b=3>5. 

3) Sim.3 We simulate from a MSRF whose components have both small and 
large range. We «mix» exponential variograms with parameters a=l and 
b=3 (small range) with spherical variograms with parameters a=\ and b=30 
(large range). 

The global range is found (with large confidence) where we expected it. The 
first two simulation schemes are easily readable being quite homogeneous 
from both the spatial and «local» point of view. More interesting is to look at 
Sim.3. here the interaction between the process components has a more relevant 
role. The different spatial structures influence each other through the existing 
correlation between the MSRF components. We find the global range 54% of 
the times in the first two classes (n=50), being some of the variables with a 
small range structure and larger variance relatively highly correlated with all 
the others, the global range is «moved» towards small values. The same 
situation appears when dealing with n=30 and n=20. Analogous results are 
obtained when we assign to the same variable a large range model and to the 
others a small range one. 



Table 1. Second order MSRF: decomposition of the first eigenvalue of F and 
determination of the global range (%). 
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n=30 










■in 


40.6 


45.4 


61.0 


0.0 


0.0 


0.2 


20.6 


28.2 


21.8 


6- 10 


32.0 


36.6 


31.8 


0.2 


3.4 




20.2 


31.8 


52.2 


11-15 


16.0 


13.4 


7.2 


2.8 


26.2 


77.4 


14.0 


15.0 


i 26.0 


16-20 


6.8 


3.8 


- 


19.6 


63.8 


- ' 


8.8 


11.0 


- 


21 -25 




0.8 


- 


24.0 




- 


13.0 


14.0 


- 


26-30 


mtm 


- 


- 


38.2 




- 
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- 


31-35 


0.0 


- 


- 


11.6 


- 


- 


7.4 
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0.0 
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- 
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1.4 


- 


- 


Tot. 


100.0 


100.0 


100.0 
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The simulation of an intrinsic stationary MSRF follows the steps described at 
the beginning of this section. We build the generalized covariance matrix by 
choosing an intrinsic (isotropic) variogram model for the components of the 
process. For all of them we fixed a linear variogram model with parameters a=5 
and b=6. Applying our technique we find that the global range is found always 
very close to the maximum admissible value /i°=44. In Table 2 we compare 
results from simulations of intrinsic MSRF and large range second order 
stationary MSRF at various sample size. The same procedure described above 
has been performed. 

Table 2. Comparison between intrinsic and large range simulations (%) 



Global 

range 


Intrinsic 

MSRF 

n=65 


Large 

range 

MSRF 

n=65 


Intrinsic 

MSRF 

n=50 


Large 

range 

MSRF 

«=50 


Intrinsic 

MSRF 

«=30 


Large 

range 

MSRF 

«=30 


5 - 9 


0.0 


0.0 


0.0 


0.2 


0.0 


1.2 


10- 14 


0.0 


1.0 


0.2 


1.4 


11.3 


20.6 


15-19 


0.2 


4.2 


2.6 


19.6 


72.3 


71.4 


20-24 


4.8 


26.4 


7.8 


25.0 


16.4 


6.8 


25-29 


9.0 


23.0 


41.2 


42.2 


- 


- 


30-34 


13.2 


18.6 


39.0 


8.6 


- 


- 


35-39 


36.6 


14.8 


9.0 


3.0 


- 


- 


40-44 


36.2 


12.0 


0.2 


0.0 


- 


- 


Tot. 


100.0 


100.0 


100.0 


100.0 


100.0 


100.0 



From the 65 locations simulations it is clear that over a large enough sample we 
are able to distinguish the two spatial variability models with large confidence. 
In table 2 smaller samples analysis are shown too. When n=50, in the large 
range case we find a global range (88%) smaller then while in the intrinsic 
case 78% of the simulations find the global range closer to the largest values of 
the distance . For smaller sample dimensions we cannot discriminate between 
the two models. 



3.1 Application to botanical data 

In order to verify the behavior of our technique on real data, we analyze a set of 
observation previously studied through the technique proposed in Di Bella and 
Jona Lasinio (1996). The data are measures of abundances of several plants 
species. The measurements are centimeters covered by each of 8 species and are 
taken along a line transect of 30 meters length (each site is a square of 1 m^). 
Because of the nature of the vegetation we can assume second order 
stationarity. We applied the method proposed in Di Bella and Jona Lasinio in 
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order to find the “true” scale of the data. At that scale observations of the 
MSRF can be considered “almost” spatially uncorrelated. The “true” is found 
so that 9 units are aggregated by a type of spatial moving average. The new data 
set has been analyzed with our technique and the global range is found at 
distance h*=\. Then we analyzed the disaggregated data set. The first two 
eigenvalues accounts for the 77.8% of the total variability. In Table 3 species 
loadings are reported and in Figure 1 the first two principal components are 
shown. 



Table 3: Species loadings 



Species 


First 

eigenvector 


Second 

eigenvector 


Asparagus acutifolius 


0.001313899 


0.0402381603 


Cistus monspeliensis 


-0.077701392 


-0.0514799760 


Erica multiflora 


0.26502460’9 


0.5698000029 


Myrtus communis 


0.956737575 


-0.0879254935 


Phillyrea angustifoHa 


-0.021453125 


-0.0423429708 


Pistaccia lentiscus 


-0.088209856 


0.8129921308 


Quercus ilex 


0.009914978 


0.0008337439 


Rubia peregdna 


0.006034461 


-0.0240403769 



The spatial behavior of the 8 species is well represented. The first component is 
mainly characterized by the joint presence of Myrtus communis and Erica 
multiflora, describing a primary pattern due to these two dominant species. The 
second component describe the spatial pattern originated by Pistaccia lentiscus 
a less common plant. 

Figure 1: Data projected on the first and second principal axes. 



First principal component vs sites 




Second phncipaf component vs sites 
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More interesting are applications of the proposed method on grids data (regular 

and non regular grids). In this contest the authors are working at several 

theoretical and applied developments 
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Abstract: Surveys of spatially distributed phenomena are often conducted using 
geographical areas as strata. If CATI (computer assisted telephone 
interviewing) methodology is used to contact the units, it is possible to choose 
between two different methods of selecting the units to be interviewed: either 
fi-om a full list of the population units or a selection based on RDD (random 
digit dialling) technique. On this last case it seems natural to telephone 
exchange areas as strata. This could be a very interesting solution firom many 
point of views. The aim of this paper is to find a methodology to asses the 
opportunity of using such a pre-defined geographical stratification in 
comparison of the usual clustering methods, based on a set of auxiliary 
variables correlated with the phenomena xmder study, to define the strata. The 
choice will depend on the use of some measure of similarity and/or the 
evaluation of the homogeneity in the strata for the specific phenomenon to be 
analysed (in our application the evaluation of the loss of homogeneity is 
verified with respect to a hypothetical set of variables under study). 

Keywords: Contiguity Classification, Telephone Surveys, Cluster Validation. 

1. Introduction 

In statistical surveys the methodologies based on the telephone as a cheap and 
fast contact tool are growing and they are more and more diffused. In this case 
it is very interesting to consider the use of computer assisted telephone 
interviewing (CATI) to survey spatially distributed phenomena for which it is 
necessary to use geographical areas as strata. Many auxiliary information, that 
can be very useful in order to analyse a particular phenomenon, are often 
available at geographical level: for example, a considerable quantity of 
georeferred census data is available at enumeration district level. 

A first CATI selection method is based on the use of a telephone directory as a 
fi:ame. In this case the units drawn fi-om the directory can be georeferred 
through the address matching spatial analysis. This is not an easy task, as it 
needs the processing of character data as the complete address of every 
telephone subscriber. Moreover, some telephone subscribers are not included in 
the telephone directory and for this reason they will be excluded from the 
sxirvey. 
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An alternative selection method is based on the use of random digit dialling 
(RDD) technique to choose the units to be called. With this method the 
telephone directories are not used for the selection and every telephone number 
has the same probability to be dialled. To avoid loss of time, the random 
numbers must be generated keeping into account the code of the telephone lines 
installed in the telephone exchange of each area involved in the survey. With 
RDD methodology a simpler way for georeferring the information is to use the 
telephone area as geographical imit. 

In Italy, the telephone system is based on a hierarchical system of enumeration 
using the telephone exchange area as the smallest geographical area. The 
telephone numbers are composed by an area code (telephone district code) and 
the telephone subscriber munber (the first 2, 3, or 4 digits of this number 
identify the telephone exchange area). Practically, every telephone exchange 
area is associated with one or more enumeration sequences. Using the 
enumeration sequences and the coverage of the telephone exchange area it is 
possible to georefer the telephone number randomly generated. 

Through the GIS (geographical information system) functions it is possible to 
obtain other information on the areas imder study (enumeration districts and 
telephone exchange areas) such as, the centres of gravity, the co-ordinates, and 
the size of the areas. Furthermore, with a GIS it is possible to conduct any kind 
of spatial analysis. The geographical organisation of the data would also enable 
the pattern analysis of the considered phenomenon and the final representation 
of the results. 

In this paper we make a first attempt to define a methodology to verify the 
opportunity of using the telephone exchange areas instead of a partition of 
clusters of enumeration districts build through a classification algorithm imder a 
contiguity constraints. In the next section we define the classification problem 
and the constraints to be imposed. In the following section we make some 
suggestion on cluster validation based either on some measure of similarity or 
on the evaluation of the homogeneity of some auxiliary variables that are 
probably highly correlated with the surveyed variables. Before the conclusions, 
we report an empirical application using real data on Florence telephone 
sectors. 

2. Constrained classification 

The purpose of cluster analysis in this kind of application is the identification of 
homogeneous areas with respect to the phenomena under study. A geographical 
area is considered homogeneous if the inhabitants have a similar behaviour 
about the phenomena imder study. The geographical areas are those obtained 
from the aggregation of smaller areas (as the enumeration districts). 

Our problem can be defined as follows: let Xhe the set of the N spatial objects 

represented by the enumeration districts: X = {X,,A' 2 ,...,W^}, and consider a 
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Specific partitioning f of A" in ^ sets, where the partitioning refers to the 
enumeration district: 

Y = {Y„Y„...,Y,}, 

where each group is a contiguous set of the N objects of X, represented by the 
telephone exchange areas: 

Our assignment is to produce a new partition ofX, i.e. Z, so that: 

Z = {Z|,Z2,...,Z^} , 

where each group is also a contiguous set of the N objects of X: 



For N sufficiently large this can be accomplished by using a non-hierarchic 
method. For our application we used the FASTCLUST procedure of SAS 
imposing K=L (SAS, 1990). But in our classification problem, it is relevant to 
impose constraints on the set of the allowable solutions. 

When a classical clustering technique, such as K-means, is applied to 
geographically located data, without using the spatial information, the resulting 
partition has often a sparse appearance over the geographical space (i.e. clusters 
look dispersed, and reflect only poorly any eventual underlying spatial 
structure). 

Several alternatives have been proposed in order to take into account the 
geographic location of the data, and produce clusters that are spatially 
homogeneous. Gordon (1996) presents a complete review on this topic. In this 
paper we consider only the approaches more suitable for the solution of our 
problem. 

The most common type of constraints is the one specified by contiguity: the 
objects in a class are required not only to be similar, but also to be spatially 
contiguous. 

Proximity information in two-dimensional space can be incorporated more 
directly into a classification by making use of the geographical distances 
separating each pair of objects, or a contiguity graph or matrix (Legendre, 1987, 
Openshaw, 1977). This implies the definition of a neighbourhood concept. 

In this approach, the elements of a contiguity matrix are defined by: 



C(i,j) 



1 if the ith and yth objects are contiguous, 
0 otherwise. 
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The specification of the contiguity graph of a set of objects is not always 
straightforward. If the objects are areas of land covering a region, areas sharing 
a common boundary should clearly be regarded as contiguous, but when the 
number of these area objects increases some computational problems can arise 
as in our case: the dimension of the contiguity matrix was very large. 

Other approaches to the contiguity-constrain classification consist in the 
computation of a distance between objects that is a function both of the 
geographical distance and the distance measured through the values of the 
variables. In general, the modified distance between the spatial objects i and j 
can be written as: 



Wij = the/th and ^h object are contiguous, 

\d^J = dy otherwise, 

where / is an increasing function of both arguments, dy is a distance measure 
between the vectors of the variables on /th and yth objects, and y/tj is an 
increasing function of the geographical distance between the /th and yth objects 
(Zani, 1993). This approach requires the weighting of the geographical distance, 
xj/ij, and of the distance measured on the variables dy, imposing ¥’j)^ it is 
not easy to specify the relative weights in an objective manner. 

We followed a natural approach that uses the geographical co-ordinates of the 
centres of gravity of the enumeration districts, more or less heavily weighted, as 
an additional pairs of variates (Barry, 1966, Jain and Farrokhnia, 1991). 



3. Partition assessment 

Few constrained classification studies have addressed the problem of cluster 
validation, but in our case we consider this problem relevant. In our situation the 
problem is slightly different fi’om a normal cluster validation. As already said, 
our task is to compare the two partitions of X, where one, Y, was defined to be 
the telephone exchange areas, and the other, Z, was obtained with a clustering 
algorithm taking into account spatial relations. In other words, Z is a set of 
clusters of enumeration districts suitable for data analysis, while, at least for 
practical reasons, Y would be a clustering suitable for data collection. 

The problems to be addressed are now: 

1) evaluating the correspondence between the two partitions, i.e. to measure 
the resemblance between partition Y and partition Z; 

2) evaluating the adequacy of partition Y with respect to partition Z for data 
analysis. 

Regarding the first aspect, we have to measure the similarity between the two 
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partitions using an index that compares the partitions. A basic unit of 
comparison between two partitions is how pairs of objects are clustered. 
Starting from this concept Rand (1971) proposed an index c, varying from 0 to 
1, based on the calculation of the individual element-pair placed together in a 
cluster in each of the two partitions, Y and F. Rand proposed a simple 
computational form for this index: 



c{Y,T) = 









where N is the number of the objects, and mj is the number of element 
simultaneously in the ith cluster of Y and in the yth cluster of F. 

The relative size of the Rand index and of other similar indices is difficult to 
evaluate and compare, as they are not corrected for chance. So Hubert and 
Arabic (1985) proposed a correction for chance, and the corrected Rand index 
assumes this form: 



c\Y,T) 




(n\ 



The other method to be considered is based on the evaluation of the adequacy of 
partition F with respect to partition Z for data analysis. One mefriodology can be 
based on the use of a raw measure of the homogeneity in the groups. For this 
purpose we assumed the average R-squared weighted by variance. This index 
must be calculated for each of the two partitions to evaluate the loss in 
homogeneity implied by the use of the telephone exchange areas. 



4. The application 

The application was carried out by using the data of the telephone sector of 
Florence and the census data of the emuneration districts of the same area. 

The telephone sector of Florence covers the municipal area of the city of 
Florence and eight other mimicipalities of the Florentine Province (Bagno a 
Ripoli, Calenzano, Impruneta, Fiesole, Rignano, Sesto Fiorentino, Scandicci e 
Vaglia), which are adjacent to the city of Florence. The map of the geographical 
coverage of telephone exchange areas was digitised and stored in the GIS 
ARC/INFO (see Fig. 1). A unique numerical code, that is the key for the 
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linkage of the graphic areas with the enumeration sequence table to be used 
with the RDD methodology, was assigned to every telephone exchange area. 
The sequence table allows assessing the potentiality of every telephone 
exchange device. 




On the other hand, the enumeration districts are often too small geographiceil 
areas to be used as strata in a survey; for this reason we grouped the 
enumeration districts using clustering algorithm with and without the contiguity 
constrains. The map of enumeration districts of the studied area was already in 
digital format and stored in the GIS. 

The data considered for the cluster analysis were only few of the census 
variables at the emuneration district level, which the Italian central bureau of 
statistics (ISTAT) disseminates for the 1991 population census. 

To preserve the confidentiality of the data, the variables released by ISTAT are 
mainly social-demographic variables. Unfortunately, we could not use any 
economic variables (i.e. income, professional position, etc) that would be more 
relevant for our study. The social-demographic variables give anyway a first 
view of some social aspects that can be considered as auxiliary information for 
the stratification. For this purpose we considered 10 variables; the population 
density (using the area information supplied by the GIS), the percentage of 
males, an index for the age population structure, the percentage of self- 
employed workers, the number of houses per person, and the ratio of occupied 
houses to total houses. 

Some spatial analyses as the polygon overlay was carried on the stored digital 
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maps using the GIS to assign to every enumeration district the corresponding 
telephone exchange area code. 




The telephone exchange areas are a partition of the geographical space, but they 
are not always an exact aggregation of the enumeration districts. The fact that 
an enumeration district belongs to different telephone exchange areas can be 
solved by attributing the district to the telephone area with the highest 
percentage of overlapping. In this way, we obtain the complete list of the 
enumeration districts belonging to each telephone exchange area. Of the 5624 
enumeration districts enclosed in the above listed mimicipalities only 3813 were 
used, the ones overlapping the 49 telephone exchange areas in which the sector 
of Florence is divided. It is then possible to group at telephone exchange area 
level all the information at enumeration district level (Mohadjer, 1988). 

So the problem was to consider the 3813 enumeration districts overlapping the 
49 telephone exchange areas as elements to be grouped before carrying out the 
analysis. On the other hand, the 49 telephone exchange areas can already be 
considered a partition of the enumeration districts, which cannot be further 
divided. The result of the cluster analysis performed is showed in Fig. 2. 

At this point, using the Rand index on the two partitions we found a value of 
c=0.95 while using the corrected Rand index the value of c' is 0.50. That result 
means that the overlapping between the areas is very high but about half of the 
value is due to the contiguity condition and to the constrained imposed on the 
number of cluster (K=L). 

We performed also a raw evaluation of the adequacy of partition for data 
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analysis with the R-squared index. Using the telephone exchange areas as strata, 
the value of R-squared was 0.09. With a non-hierarchical partition, allowing a 
number of clusters equal to the number of the telephone exchange areas, the R- 
squared was of 0.59. To evaluate more precisely the loss of homogeneity using 
the telephone exchange areas as strata, the same index was computed on groups 
of enumeration districts verifying the contiguity of the areas that compound 
each cluster. The R-squared value obtained (0.29) is nearer to that computed on 
telephone exchange areas, even if the difference is still remarkable. 

5. Conclusions 

We have proposed the utilisation of the telephone exchange areas to stratify a 
geographical area. The objection to the ad-hoc clustering can be overcome by 
some measure of similarity or by some homogeneity indices. We applied some 
of these indices to assess the stratification in the Florence area and found that 
the telephone exchange areas were very similar to the constrained clustering of 
the enumeration districts. In such a situation the utilisation of the telephone 
exchange area as a frame to conduct an RDD survey would be appropriate. 
Obviously the operative decisions should to be taken considering the real 
phenomenon under study for which the data will be collected. 

The methodology proposed in this paper is a first attempt to face the problem of 
comparing a given area partition with a computed clustering. Further studies 
must be carried out to define a more suitable homogeneity measure. 
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Abstract: This paper faces the problem of the appUcation of filtering and 
smoothing algorithms, in particular Kalman filtering, to spatially dependent data. 
We take into accoimt the case of first- and second-order homogeneous Gauss- 
Markov Random Fields (GMRF), and we address the question of parameter 
estimation for this class of spatial processes; then we consider the possibility of 
expressing these processes as unilateral ones, so that they can be written in state- 
space form; and, finally, we present a “classical” Kalman filter algorithm, v^diich 
is particularly suitable for the case of satellite images contaminated by additive 
Gaussian noise. 

Keywords: Gauss-Markov Random Fields, unilateral representation, Kalman 
filter. 



1. Introduction 

hi this paper, we shall refer to the case of spatial data, collected on a regular 
lattice divided into NxM zones; when the random variables (r.v.) taken into 
account will be followed by a single index, we shall suppose to have numbered 
the zones firom 1 to n (n= NxM), moving firom the left to the ri^t, and firom the 
top to the bottom of the lattice. 

In general terms, denoting as the p-th order neiehbours set of the zone (ij) 

(chosen on the basis of a certain distance criterion, normally the EucHdean one, 
and of a given order of spatial lag), we define as Markov random field the 
spatial process for which the following is vahd: 



Pr(xij 1 Xki:(k,l)eD'^) = Pr(xjj | Xki:(k,l)e S^*^) (1) 

where D‘^ denotes the parametric space fi'om which the zone (ij) has been 
deleted. If we hypothesize that the conditional distribution of the generic r.v. Xi 
is normal, with parameters: 

11 

0i = E(Xi|xj:j?^i) = Pi+Xicij(xj-|Xj) ; tf = Var(Xi|xj:j?ii) , 
j=i 
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we obtain, by applying Brook’s factorization theorem (Besag,1974); 

X~NMV(p, (I-C)'*M) (2) 

where M is the diagonal matrix containing the conditional variances, and I-C is 
the NMxNM matrix containing the field potentials (that is, the spatial interaction 
parameters), and is therefore called potential matrix. 



2. Gauss-Markov random fields: structure and parameters 
estimation 

Under the hypothesis of equal conditional variances, that is, Tf=x^, 
Vi=l,2,...,NM, the exponent of the density fimction (2) may be written in 
conq>act form: 

U(x) = -(2t^)'^x’Ax 

where A=I-C. The study of GMRFs is, obviously, based on the analysis of this 
matrix, which completely characterizes the process. In particular, the potential 
matrix of a homogeneous GMRF is always decomposable as; A = Ac+Ab.c., 
wiiere Ac is the canonical potential matrix, independent of boimdary conditions 
(b.c.), while Ab.c. is the border potential matrix, i.e., the matrix which contains 
the interaction parameters among internal and external zones, hi the case of a 
first-order homogeneous GMRF with Dirichlet b.c. (all the off-lattice zones 
values are put equal to zero), we have that Ac.b.=0, while Ac may be written as; 



Ac =In®Bi + Hn0Ci , where: 





r 0 1 0 1 




1 -Po, 0 






10 1 




-Poi 1 -Po. 0 ••• 




Hn - 


... .^. .^. .^. 


; B,= 


0 -Po, 1 -Poi ••• 


-VPo.H„ (3) 




0 1 0. 




. ::: ::: o -p„^ r_ 





and Poj is the only horizontal interaction parameter; and again; Ci = 
where represents the only vertical interaction parameter. 

For first-order, and a particular class of second-order GMRF (that for which the 
diagonal interaction paramers are the same), it is possible to derive exact 
analytical expressions for the eigenvalues of A by applying Kronecker product 
rules; these eigenvalues are (Jain, 1979; Balram & Moura, 1993): 

- first-order fields: 



hiM) = 1-PviMHn) -Poi?^j(HM) , i=l,...,N ;j=l,...,M 



(4) 
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- second-order fields (with equal diagonal parameters, i.e. Pdlu'^Pdru'^Pdii)- 
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ITT 



COS 



J7C 



A,ij(A 2 )=l“ 2 Pv COS 2Po cos— 4p^ cos 1 

i=l,...,N J=1,...,M (5) 



We are now able to give some indication for ML estimation of parameters. The 
negative log-likelihood (multiplied by the positive constant 1/NM) of a first- 
order, zero-mean GMRF defined on a NxM regular lattice is; 

=(l/2)ln(T^)-[l/(2NM)]ln|Ai(p)| + [l/(2x^NM)]x’Ai(P)x] 



The quadratic form in L(.), that is x’Ai(P)x, may be written in a much easier 
form, remembering of the structure of Ai, and considering the result (4); we 
obtain: 



L(x;Poi>Pvi>'^^ = 

1 2 1 N M 

+-L(s.-2P„R;i-2|3v,R?) (6) 

v^ere we put: 

=^x’(Hn®Im)x 

The only closed-form estimate is that of the variance, and, substituting the 
expression for t in L(.), we derive the so-called profile log likelihood-. 

u.) = ^ln(S^ - 2Po^R°' - 

J N M J 

ZZM1-PviMHn)-PoiMHm))+2 (7) 

i=i j=i 



which has to be minimized on the valid parametric space to obtain ML estimates. 
The same lines are followed for estimating the parameters of a second-order 
GMRF: the only difference to remark is the definition of the potential matrix, 
which assumes the form; 
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A 2 ~ In®®2 + Hn®C2 

in which Bj is equal to Bi, and so it has the structure indicated in (3), while C 2 is 
given by: 



C2 — PvilM“PdiiHM 

Under the hypothesis that the sanq)le data, xy, are contaminated by additive 
white Gaussian noise, the model becomes: yij = Xij + Tiy, where the x-^s represent, 
as it has been told, our (first-order) GMRF, while the t|„s are the noise terms, 
for which the usual hypotheses are valid: 

i) ti..~N(0,ct^); ii) E(Ti..Tiy)=0 ,V (ij)7i(k,l); iii) E(x..Tiy)=0 , Vi,j,k,l 
It is then possible to write (Moura & Balram,1992): 

Sy « Sx + a" Ry* « R"i Ry* « Rx' 

So, if it is possible to have an estimate of the noise variance, a direct substitution 
of these terms in (7) will agahi permit to obtain ML estimates. 

We now present some experimental results on parameters estimation for first- 
and second-order homogeneous GMRF. We have generated, in each case, 30 
different realizations of the process (lattice dimension, 64x64; Dirichlet 
boundary conditions), using the recursive algorithm which will be discussed 
below; Tables 1 and 2 refer to ML estimation in absence of noise, while Table 3 
presents the results in the case of a first-order process corrupted with additive 
Gaussian noise. 

Table 1: first-order homogeneous GMRF. Parameters: Pq^= 0.25; pyj=0.15; 
2 

T =100; 30 simulated samples. 





Mean 


St error 


Poi 


0.2431 


0.0226 


Pv. 


0.1550 


0.0265 


T2 


100.2280 


4.2146 



Table 2: second-order homogeneous GMRF. Parameters: Pqj= 0.05; Pyj=0.05; 
2 

Pdjj=0. 15; T =400; 30 simulated samples. 





Mean 


St error 


Poa 


0.0483 


0.0294 


K 


0.0511 


0.0264 


Pda 


0.1477 


0.0141 


%2 


403.9841 


23.5290 
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Table 3: first-order homogeneous GMRF. Parameters: Pqj=0.25; Pvj=0.15; 
2 2 

T =100; a =100; 30 simulated samples. 





Mean 


St error 




0.2370 


0.0517 


Iren 


0.1584 


0.0518 


1 ^2 


99.1988 


9.8835 



3. Recursive structure of non-causal GMRF 

We will now develop the recursive structure through which it is posable to write 
a GMRF, rnaintaining its fundamental characteristic of non-causality. The 
starting point is the Minimum Mean Square Error (MMSE) representation 

(Woods, 1972): Ax=s. It is very easy to demonstrate that Ax=e, with 

2 

s~MVN(0,t A), is equivalent to a non-causal autoregressive representation for a 
GMRF with potential matrix equal to A. From this equivalence, it is possible to 
attain to two equivalent causal (or imidirectional) representations: the first one is 
defined ‘backward”, the other one ‘Forward”. The fundamental consideration, at 
this purpose, is that, A being positive definite, it admits two distinct Cholesky 
deconq)ositions; the first is written as: A = U’U, where U is upper triangular, 
\^Me the second decomposition is expressed through the matrix L, which is 
lower triangular: A=L’L. For exanq)le, the ‘backward” representation becomes: 

Ux = w (8) 



where: w = (U’) ^8, with: = E(ww’) = and: = E(xw’) = x^U * 

It is essential to point out that the very regular structure of A reflects on the 
structure of U. So, as A is block tridiagonal, U has only two block diagonals 
different fi'om 0. Then, U can be written as follows: 



U = 



XJj 0 ••• 

0 U 2 ©2 0 



0 Un.1 ©n-1 

... 0 Un J 



where Ui and ©i are MxM blocks (as always, we are presenting the case of a 
first-order process, but the extension to higher order GMRFs is 
straightforward). The unilateral representation of a GMRF leads to a state-s^ace 
form for the process; in fact, making use of the structure of the matrix U, we 
obtain the following formulas: 
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xn— G^wn; 

Xi=r^Xi+i+Gi* Wi for l<i^-l (9) 

where Cf =Uj\ and F^=-U|'©i; the problem of the calculation of the blocks 
Ui and ©i is solved addressing to a Riccati-type iteration. Let us define: 

I 

SnUjUi; then, basing upon the Cholesky decotq>osition of the potential matrix 
and the block structmre of A, we obtain the following recursive formulas: 

S, = B; 

S2 = B- c' B 'C; 

Si = B-CS;liC; for 3<i^-l (10) 

Sn ~ JB J — C SjJ.jC 



To obtaiu the blocks ©i, l<i^-l, it is sufficient to observe that the 
deconq)osition of A makes it possible to write: 

Uj©i = Cl ; 

ul©i=C; (11) 



U JJ.J ©N-l - Cn-1 



Through (lO)-(ll) it is possible to implement a very efficient simulation 
algorithm Further simplifications are reached by considering that, under general 
conditions, the succession {Sj} rapidly converges to a fixed matrix, Sqo- 
At the end of this section, we present the neighbourhood system for the 
unilateral processes we are discussing [see (8)], and the correspondent 
neighbourhood system for the equivalent non-causal models (2), this latter 
obtained making use of Euclidean distance. 



Figure 1: Neighbourhood system of the zone indicated with 0 for GMRFs of 
order 1-4 in (a) unilateral representation, and (b) non-causal representation. 
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4. A Kalman-type algorithm applied to image analysis 

Let us formally define the problem: we have at our disposal an image, defined as 
a lattice of NxM regularly spaced pixels, whose values vary in the range of the 
integers between 0 and L-1, where L is the number of gray levels (usually, 
L=256); the image being received firom a satellite, it is very hkely that it is 
corrupted by errors of various sorts (observational errors, measurement errors, 
and so on); we suppose that the model employed to describe Ihe “true” image 
under analysis is a GMRF: Ax=s; we hypothesize, moreover, that the 
observations are altered by Gausian white noise; the model for the observations, 
in other terms, is: 

y=x+Ti (12) 

2 

where: ri~NMV(0,a I). The recourse to formulas (9)-(ll), clearly, leads in a 
direct way to the use of Kalman filter; the algorithm (Rauch, Tung, Striebel, 
1965) consists of two “passages” over the image, to obtain better results: the 
first one is the filter passage (forward), the second one is the smoothing passage 
(backward). 

|*1|0 

1pi|0=t"i 
if|i = *i|i-i + Kj(y, - 

Y = F'^ Y 

Pi|,i=FfPi.,|i.i(Fi)’+aGf(Gf)’ 



I^N *N|N 

l^N ~ ^n|N 

Zi = Pi|i(Fi,)’p;;\ii 
Xi = Xi|i + Zi(ij^i - Xj^iii) 

Pi ~ Pi|i Zi(Pi+i — Pi+l|i) Zj’ 

As an appUcation, we present here the results obtained with the use of the 
Kalman filter algorithm on an image of dimension 100x100 pixels. Fig.2(a) is the 
original image; Fig 2(b) is the noisy image; Fig.2(c) and 2(d) are, respectively, 
the restorations on the basis of a first- and second-order GMRF. 



1 phase (forward! 

Initialization: 

1) Kalman gain: 

2) Filter update: 

3) Filter covariance matrix update: 

4) Forecast update: 

5) Forecast covariance matrix update: 

2“** phase (backward! 



Liitialization: 



1) Smoother gain matrix: 

2) Smoother update: 

3) Smoother covariance matrix update: 
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Figure 2: Application of Kalman filtering and smoothing algorithm. 



(c) Restoration on the basis of a 
homogeneous first-order GMRF; 
estimated param^ers: =0,288, 

P^jK),2l2, f2^368J77 



(d) Restoration on the basis of a 

homogeneous second-order GMRF; 
esti mated p a rameters : P ^ ^ =0 . 0 54, 

=0.053, 3j„=0. 197, X2 =506.542 









(a) Original image (100x100) (b) 



Image corrupted with additive Gaussian 
noise (a^=5 14, SNR=8) 



As it can be noted, though we have used very simple models (at maxiimim 3 
different spatial interaction parameters), the results are visually satisfectory, 
especially in the case of second-order GMRF. In order to have a statistical 
measure of the “goodness of reconstruction”, we calculated the Mean Squared 
Error (MSE) between the original image and each of the other ones; the results 
are the following; Fig.2(b), MSE=526.086; Fig.2(c), MSE=467.876; Fig.2(d), 
MSE=308.420. 
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Abstract: In this paper different phases of the treatment of text are sketched, in 
order to link them both with some lexical characteristics of the analised corpus 
and with multidimensional techniques useful for the statistical content analysis 
of the latter. Our proposal is directed towards maintaining intact the system of 
meanings present in the corpus and to bettering the degree of monosemy of 
words. In this way a corpus vocabulary of mixed units of analysis is realised. 

Keywords; Pre-processing of textual data. Textual variables. Mixed imits of 
analysis. Disambiguation, Locutions and Pol 5 rhematics. 



1. Domain of study and aims 

The principal changes in the evolution of statistical-quantitative studies of 
natural languages could be S5mthesized through many adjectives; from the 
linguistic (Zipf, Waring-Mandelbrot, Yule: see Herdan 1964) to the lexical 
(Guiraud, Quemada, Imbs, Brunet; see Muller 1977), from the textual (Lebart 
and Salem 1994) to the lexico-textual (Reinert, Labbe, Silberztein, Elia: see 
Bolasco et al. 1995). Often in the course of this evolution, the theoretical 
disputes - for instance, the one relative to the choice of the unit of analysis 
(headword or graphic form, words or segments) - draw our attention away from 
our empirical evidence, in this case the criterion of goodness of fit to tl^e 
semantics of text. 

Throughout these years, multidimensional methods for the analysis of natural 
languages, borrowed from the analysis of munerical data (qualitative and 
quantitative) have been proposed, with their corresponding statistical packages, 
but without any serious attention to “text care” nor has software for this very 
time-consuming phase been developed. This care should be understood as the 
automatic lexico-textual treatment of the linguistic information with the aim of 
the content analysis of a corpus which is the object of study. In this way, we deal 
with the problem of the transformation of text into statistical data. Text has, in 
fact, a sequential type structure. Therefore, each operation of segmentation into 
units of analysis causes a loss of information. 
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The transformation from the linguistic information to textual data requires, 
apart from the logic of the study, an accurate work of “pre-processing” in order 
to minimise this loss. This is an aspect completely neglected by the statistical 
techniques of textual analysis; these latter, indeed, expect the text to be analysed 
without any influence on the starting information (Lebart, Salem 1994). 

On the contrary, it is possible to improve the statistical quality of the 
linguistic data and to guarantee, on the one hand, monosemy (intended as 
stability of the meaning of the word in spite of context changes) and, on the 
other, coherence with the content of the text. The latter in particular is satisfied 
by choosing, as elementary units of analysis, mixed forms (for instance lexias or 
textual forms, Bolasco 1997). 



2. Some paradigms in order to perform text segmentation 

If we want to identify the least meaningful units in their discourse, we should 
consider both single forms and polyforms. With this aim in mind, it is 
fundamental to conform to sequencing when considering the steps to follow. 
Here, some general principles for the segmentation and the disambiguation of 
the text are defined, in the form of 5 paradigms. 

A) The graphic forms (GF), by the means of which a text is presented, are 
carriers of the specific contents of the discourse (subjects, times, modality, 
intentions). In order to preserve intact the system of the meanings present in the 
corpus, we need to observe the following. 

1- To maintain the graphic form as a textual variable of the first level. 

2- To resolve ambiguity, be it semantic or grammatical, in the graphic forms 
where the ambiguity is not irrelevant in terms of occurrences. A grammatical 
disambiguation could lead to a semantic disambiguation: the single form 
<abito>, if it is a noun, means "an article of clothing"; if it is a verb, it is the first 
person singular present indicative of "abitare", and so <abito> means “I live”. A 
semantic disambiguation, starting from an analysis of concordance, could lead to 
a grammatical one; for example <im dato di fatto> is an idiomatic expression 
(meaning “in point of fact”) in which we can clearly identify <dato_noun> and 
<fatto_noun>, once their classification as verbs (“given” and “made”), has been 
excluded (which becomes possible if the analysis is conducted at the level of 
single words as: datum of fact, or datum of made, or given of fact, or given of 
made). 

3- The lack of isofrequency between two unlemmatised forms derived from 
the same lexeme is a sign of an imbalance of use, which is a function of 
meanings. In this case, the disambiguation of the most frequent form is the most 
opportune thing to do (Bolasco 1998). 

B) Every discourse is dense with locutions, or rather with fixed-meaning 
sequences (idiomatic expressions). These locutions are called polyrhematics, or 
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rather, expressions “whose total meaning is not calculable from the component 
lexemes” (De Mauro et al, 1993: 153), that is to say their whole sense is 
different from the sum of the meanings of their component terms; for example, 
<materie prime> (“raw materials”) or <capo dello stato> (“head of state”). 

4- The preliminary and automatic recognition of the more common locutions 
eliminates at a stroke the ambiguity of the homographic types which make up the 
locutions (see for example the aforementioned <dato di fatto> meanings “in 
point of fact”). The availability of frequency dictionaries of polyforms facilitates 
the recognition of these locutions (Bolasco, Morrone 1998). 

5- The lexicalisation of polyrhematics should be carried out according to the 
following criteria: 

i) The need to capture the sequences from the most expanded to the most 
shortened, < fino in fondo> (“to the furthermost point”), <in fondo> 
(“basically”). If two sequences are of equal length, the one with the highest 
number of fiill words should be lexicalised first: <piu presto possibile> (“at the 
earliest time possible”) before <al piu presto> (“as soon as possible”). 

ii) The need to recognize the grammatical locutions (adverbial, adjectival, 
prepositional and conjimctive) and the nominal polyrhematics at the graphic 
form level <al fine di> “with the purpose of’ <alla fine di> “at the end of’; on 
the contrary, phrasal verbs, given the high number of inflectional forms from a 
verb, should be captured at the headword level, otherwise they would not be 
easily describable in terms of frequency. 

iii) The need to lexicalize sequences not only when they are very common, 
but particularly when they are strongly absorbent with respect to the single forms 
of which they are composed; for example <lavori agricoli> (“agricultural 
works”) 28 occurrences, of which “lavori” (58) and “agricoli” (30). In order to 
select the nominal polyrhematics, reference could be made to a particular index 
of relevance (Morrone 1993). 



3. An environment for the treatment of textual data 

In order to perform these paradigms, we propose an integrated package of 
tools (hereafter called TALTAC: Italian acronym of Automatic Lexico-Textual 
Treatment for Content Analysis) that allows us to check and select the important 
linguistic information in order to make it consistent with the statistical analysis. 

TALTAC gathers together in a single "environment" both several base 
linguistic resources, available from outside the corpus which is the object of 
study (frequency dictionaries, electronic lexicons), these being useful in the 
lemmatisation phase, together with a series of tools for "text pre-processing". 
TALTAC is independent of any specific statistical package for the analysis of 
textual data, but it allows a series of interventions on the corpus which are 
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directly linked to the different phases of an applied strategy of statistical 
analysis. 



4. Phases of a strategy of statistical analysis on textual data 

For our purposes, we can break down every strategy of statistical analysis on 
textual data oriented to a content analysis, into at least some of the following 
steps: 

a) Definition of the type of corpus and storage of the text: possible identification 
of the fragments (propositions, phrases) and “numeration” of the words; possible 
matching of the subtexts with categorical variables; b) Creation of various 
indexes: dictionary of graphic forms and/or inventory of repeated segments; c) 
Analysis of concordance: study of the local contexts for the purpose of 
producing semantic and grammatical fusions/disambiguations; d) 
Categorisations of the words with meta-textual information: recoding by means 
of equivalencies for grammatical tagging, lexematic reductions, formation of 
families of words and reconstruction of thematic fields; e) Selection of lexical 
units of a set of forms as object of study: extraction of the specific language, 
thresholds of frequency; f) Partition of the corpus in subtexts: to obtain statistics 
on use and specificity or to construct data matrices in order to apply statistical 
techniques; g) Analysis of simple (Lafon 1980) or chronological (Salem 1988) 
specificity, h) Application of multidimensional methods of content analysis: 
correspondence analysis, discriminant analysis (Lebart 1995), cluster analysis 
(Reinert 1995), semantic networks (Scott 1997); i) Links with external 
information: positioning of qualitative categorical variables on the results of the 
analysis. 



5. Phases of the automatic lexico-textual treatment of text 

The TALTAC treatment is characterised by the adoption of a strategy of a 
recursive nature that does not precede the statistical analysis, but integrates with 
it. The package of textual analysis provides the raw information that, processed 
by TALTAC, allows us to make the text more consistent with further statistical 
analysis. The TALTAC environment is made up of a number of phases each 
consisting of different steps that are in interaction with the statistical programs of 
textual analysis. The flow-chart in fig. 1 reproduces the typical path of text 
treatment, which is determined by the combination of some modules of “textual 
pre-processing” on the corpus with others of “lexico-textual analysis” on the 
linguistic data. This path is only indicative because, even if some modules are to 
a certain degree preparatory to others, it is possible to personalise the path, by 
repeating the execution of modules already inserted or by adding steps of 
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Fig. 1 « Typical Path of an Automatic Lexico-Tcj^ual Treatment for Content Anafysk 
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analysis according to each individnal’s specific demands. Our path is articulated 
in four phases, for a total of eight steps, each with one or more modules. 

A) Text Normalisation. This first phase is the zero step of any automatic 
treatment of text. Its objective is to guarantee transparency and to assure 
exhaustivity in the automatic reading of the text. This step allows us to minimize 
the number of original graphic forms in the first vocabulary of the text and also 
to produce corpora available for comparison with variation in elaboration type. 

In textual statistics, units of analysis are defined as successions (chains) of 
characters included between delimiters (Lafon 1984, Lebart, Salem 1994), 
where the valid characters and the delimiters define the reference alphabet. 

The normalisation module implements the transformation from an 
automatically identified succession of characters to the graphic form, as type or 
unlemmatised form (Silberztein 1993), which represents the most elementary 
level of textual data, eliminating possible sources of data splitting. 

This step includes the following points: definition of an alphabet (valid 
characters + delimiters); normalisation of dates, numbers, names, acronyms and 
of accents, apostrophes, spaces, capital letters; association of categorical 
attributes to the different parts of the corpus (fragments and/or subtexts). 

B) Textual Pre-processing. In this phase the objective is to find, in the corpus 
of textual data, the information which is crucial for the preparing of the best unit 
of analysis. This phase is composed of three steps. 

1. The first step (Lexico-Statistical Measurement in fig. 1), useful for the 
choice of level of frequency threshold in the following content analysis, 
calculates, on the vocabulary of the normalised corpus, some basic lexico- 
statistics such as: rank, range of frequency, coverage of the text, lexical richness 
of vocabulary. 

2. The second step {First Level of Disambiguation in fig. 1) makes a primary 
disambiguation of types by means of three distinct modules: 

a) The selection module of non Iso-frequency situations (Bolasco 1994, 1998) 
that allows us to highlight potentially ambiguous words on which an intervention 
of disambiguation is necessary for the purposes of analysis. 

b) The identification module of the most important basic structures of the 
discourse, through the comparison with the fundamental dictionary of 
Grammatical Polyforms as adverbial, adjectival, prepositional and conjimctive 
locutions (Bolasco, Morrone 1998b). Their recognition allows a systematic 
disambiguation of 25% of occurrences of very usual ambiguous types or words 
(homographs). 

c) The selection module of the repeated segments that allows us to identify 
automatically the principal Content Polyrhematics or nominal locutions 
(Morrone 1995). Due to the operational definition of repeated segment (Lebart, 
Salem 1994), the original inventory of segments, selected by the statistical 
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package, is difficult to consult, since it presents many elements which are not 
meaningful for the purpose of the analysis. In order to obviate this problem, the 
module identifies, by means of an index of relevance (Morrone 1993), the 
grammatically complete segments that are specific expansions of a catalyzer 
form. 

3. The third step (Lexicalisation in fig. 1) builds the complex units (lexias) by 
means of lexicalisation, that is composing sequences of words (idiomatic 
expressions) to be treated as single forms. This module attends to the recognition 
of such structures (grammatical locutions, phrasal verbs and nominal groups), 
which are semantically very specific (for instance as polyrhematics). 

At the end of this second phase, a first corpus is made available, segmented 
into units of analysis of mixed type (single forms and polyforms) in order to 
submit it to statistic analysis. 

C) Lexico-Textual Analysis. The third phase (composed of the fourth and 
fifth steps) realize a lexico-textual analysis for which it is fimdamental to interact 
with linguistic external resources like dictionaries of single or complex forms, 
local grammars and frequency dictionaries. This phase has the objective of 
exploiting these bases of linguistics knowledge in order to select meaningful 
textual variables for content analysis. 

4. The fourth step consists of the module of Grammatical Tagging (fig. 1), 
available for the Italian or the French languages (Morrone 1995), which is a 
necessary condition for text lemmatisation. The modiile call up some lexicons, 
which can be upgraded and also integrated with sectorial languages, and 
attributes the grammatical category and the headword to each word. In the case 
of homography, the module associates it with all the possible couples 
category/headword that are pertinent. This operation allows us to identify 
ambiguous forms, to convert the text to graphic forms with grammatical 
categories or to headwords (dictionary address), and to get for each headword the 
list of the inflectional forms really present in the text. 

This step should be performed a latere of the statistical analysis, in that the 
lenunatised corpus doesn’t always produce a gain in information. In fact, 
whatever the method used - markovian processes (Grigolli et al. 1991), or local 
grammars (Silberztein 1993) -, automatic lemmatisation is still subject to a rate 
of specific error superior to 15%. For the purpose of reducing errors, it is 
preferable to lemmatise the corpus after the recognition of pol 3 dbrms and fixed 
phrases. In general, for the goals of content analysis, an almost systematic 
lemmatisation of verbs and of adjectives is useful, while the lemmatisation of 
nouns is not always opportune. 

5. The fifth step (Thematic Selection in fig. 1), permit the selection of relevant 
textual variables. This step could develop in two different perspectives: A) 
lexical or B) textual. The first allows us to individualize the specific language 
with the help of linguistic external resources (language models of reference); the 
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second, by means of factorial preliminary analysis on graphic raw forms, allows 
us to individualize the "key" words for further content analysis. 

A- Comparison with reference models. When the corpus is of a large 
dimension and the vocabulary is composed of many thousands of headwords, it 
is necessary to limit "textual variables" only to the typical language (original 
words + Positive/Negative specific words). Frequency dictionaries (VELI, Lip, 
Lif, LE, Tpg, FdP) permit us to select such words, and at the same time to 
indicate intrinsic specificity. 

To select the intrinsic specificity or the typical language it is necessary to 
compare the vocabulary of the corpus with the model (fi’equency dictionary of 
reference). This involves the calculus of the index of use and some measures of 
lexical correlation. The index of use is based on the measure of dispersion 
(Muller 1977) that presupposes the construction of the contingency table, words 
by texts (parts of the corpus), available directly from the statistical package. 

The measures of lexical correlation, based on the comparison between rank of 
words in the vocabulary and the correspondant rank in the frequency dictionary, 
allows us to delineate the fundamental core of vocabulary. The technique of 
parallel coordinates (Wegman 1990) represents graphically the differences 
between ranks and allow us to select, for instance for each part of speech, the 
more specific forms. This step is useful for the selection of verbs (Bolasco 
1998). 

B- Preliminary correspondence analysis on graphic forms. If the comparison 
with a frequency dictionary of reference is not possible (when the corpus is small 
or a pertinent lexicon of reference doesn't exist) we may move on to an 
exploratory analysis of the factorial type on the graphic forms with high 
frequency, to characterise the structural words (key words) on which the 
interventions of second level of disambiguation (semantic meaning) will be 
focused. 

6. The sixth step (Decoding and Recoding of Text in fig. 1) is useful in 
deciding which corrections should be made directly on the text, those which 
have to be made virtually via software and those which should be abandoned 
since they would reduce the quality of the textual data. Beginning from an index 
of the "thematic forms" (typical language or key words) choices of intervention 
are made, usually involving less than 10% of the selected forms. 

With this aim in mind, in the first place the analysis of concordance is made 
by means of calling up the integral text in order to obtain a reading of local 
contexts. In this way, for instance, it is possible to identify the different 
meanings of a single form. Afterwards some hypotheses of both grammatical 
and semantic disambiguations or fusions are examined. For some doubtful or 
critical cases, it is possible to imdertake validation processes with a bootstrap 
method on a factorial plane (Balbi 1995). The application of resampling 
techniques allows us to build confidence regions (as convex hulls) of single 
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inflections, in order to verify the impact of the choices which could be made 
(Bolasco 1998). 

On the basis of the information collected during the previous phase, the list of 
the recodings is compiled whose realisation, effected inside the package of 
statistical analysis, does not modify the original text. 

D) Content Analysis by Multidimensional Methods. 

1. In this last step {Construction of the Final Vocabulary in fig. 1), the 
vocabulary of mixed forms with a frequency higher than the select threshold for 
the content analysis is set. Such units of analysis are forms of lexico-textual type 
(headwords, graphic forms, lexias as locutions and nominal groups) with high 
monosemic content. A final treatment balance is also produced in terms of the 
disambiguated or lemmatised forms, as well as of the singled out lexicalised 
sequences. Finally, the coverage rate is calculated before and after the treatment. 
A rate higher than 80% is already obtainable with less than 12% of graphic 
forms of the vocabulary, in correspondence with the rank of the first decile of the 
low frequencies. For corpus of about 50,000 occurrences, the absolute frequency 
of such a threshold is not inferior to 10. 

This vocabulary, in which significant variability has not been reduced and in 
which, at the same time, monosemia and variables robustness increases, becomes 
the point of departure for the study of the text in order to perform the further 
content analysis. 
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Abstract: In this paper we treat the problem to analyse a data set constituted by 
multivariate growth curves for different subjects; thus in this context we deal 
with 3 -way data tables. Nevertheless, it is not possible using factorial techniques 
proposed to deal with 3 -way data matrices, because the observations are 
generally not equally spaced; moreover a multilevel approach founded on 
polynomial models is not suitable to deal with intrinsic nonlinear models. We 
propose a non-factorial technique to analyse auxological data sets using an 
intrinsic nonlinear multivariate growth model with autocorrelated errors. The 
application to a real data set of growing children gave easily interpretable 
results. 

Keywords: Longitudinal studies, multivariate growth models, nonlinear 
regression, serial correlation, MLE, three-way data. 



1. Introduction' 

The analysis of data sets constituted by multivariate observations depending on 
time for different subjects is a widely studied topic; it depends on many 
conditions concerning the kind of data, their quality, the purpose of the analysis, 
and so on. In this paper, we are concerned mainly with the analysis of real data 
constituted by multivariate growth measures of a set of children, surveyed on 
different times. Therefore, at least formally, we have a 3 -way data table and so 
we could think to use one of the specific techniques proposed to deal with 
3 -way data matrices, based mainly on different types of factorial decompositions. 
Three common methods proposed to deal with 3 -way matrices “individuals x 
variables x occasions” are: 

a) STATIS (Escoufier, 1987), that can be seen as a principal component 
analysis where different statistical studies with many variables are compared, 
by obtaining a graphical representation where the points are the studies and 
the proximity of the points gives a similarity among the studies; 

b) the Tucker3 model (Tucker, 1966), and 
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c) the PARAFAC (PARAllel FACtor) model (Harshman, 1970). 

Both models b) and c) try to decompose the initial 3 -way matrix by considering 
sets of virtual units, variables and occasions according to a minimisation function 
(Rizzi, Vichi, 1995). Many other methods have been proposed but few means 
are available to decide which method is better than the others when we have real 
data disposed in a 3 -way matrix (Kroonenberg, 1992). Furthermore, these 
methods deal with 3 -way matrices in which the occasions are always the same 
for all subjects and give generally linear decompositions. Therefore, these 
methods are entirely unusable for data sets constituted by individual observations 
with different survey times, as in our case. It is also difficult with these methods 
to deal properly with a serial correlation structure that could be present in the 
individual data. 

In the following sections, we first present a data set and describe the multilevel 
models that could be used with growth data. Then, in section 3 we present an 
explicit nonlinear multivariate growth model with autocorrelated errors and in 
section 4 we treat the problem of the estimation of the involved parameters. In 
section 5 we present the results obtained by the analysis of our data set. 

2. Analysis of multivariate growth curves 

A longitudinal data set is constituted by k variables observed on n subjects in 
different occasions; in particular, our data set is a sample data set in the 
fi’amework of an auxological study in order to assess growth standards; we have 
the weight and the height (k=2) of babies (n=64) observed in different occasions, 
starting mainly in the first three months from the birth and ending at an age 
between 3 and 5 years old. For the i-th baby and for each variable the relevant 
information is the observed growth curve with mj different occasion ty 
(i=l,2,...,n; j=l,2,...,mi). Lags of successive surveys are in general very different 
among subjects and within the same subject so that the ty are unequally spaced; 
also, the number of occasions mi varies for each subject. Typical growth curves 
are reported in figure 1 and figure 2. 

Figure 1: Growth curves of height and Figure 2: Growth curves of height for 
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Classification approaches founded on cluster analysis techniques (Mineo, 1987; 
Chiodi, 1989) are not usefiil in this context because the data cannot be seen as T 
matrices of equal dimensions nxk; indeed sections of the 3-way data set are 
possible only along each subject. 

Among useful approaches to the analysis of longitudinal data set, with different 
survey times for different subjects and with non constant time lags for each 
subject, the multilevel models can be taken particularly into account; in these 
models variability components of l“ level (different measurements of one 
subject) and of 2"'* level (the different subjects) are considered. Besides 
multilevel factorial approaches (Borra, Di Ciaccio, 1996), it is interesting the 
2-level growth model, proposed by Goldstein and others (1994): 

p q 

yr Zviuty +Z«vZijv +eij (1) 

u=0 v=l 

where yij is the j-th measurement on the i-th subject, Yiu are the polynomial 
coefficients for the level 1 (the successive measurements), tjj is the time of the 
j-th measurement on the i-th subject, Zijv are the covariates, ctv are the coefficient 
for the covariates z’s (level 2) and eij are the level 1 random terms that usually 
are assumed to be distributed independently with zero mean and constant 
variance. So these models can be considered an extension of the pol 5 momial 
models for growth curves (Rao, 1965). 

However, we did not use this model basically for two reasons: we have been not 
interested, at least in this paper, in examining random coefficient models (in the 
multilevel model introduced above the y™ coefficients are random at level 2 with 
coefficient values varying and covarying between individuals); moreover the 
level 1 systematic components, i.e. the time dependence of the individual 
measurements, have to be in our case expressly nonlinear: so we can not 
consider a pol 3 momial model, even though of high degree, to obtain individual 
parameter estimates with a well defined biological meaning. Another opportunity 
is to consider a 2-level model with nonlinear systematic components on the 
parameters, but linear by using Taylor approximations of the first order (Milani, 
Bossi, 1988). In the next section, we analyse the presented data set using an 
explicit multivariate nonlinear growth curve model with fixed parameters. 

3. Nonlinear multivariate growth model with autocorrelated errors 

The main purpose of the present paper is an exploratory analysis of an 
auxological data set, to understand if the children have a similar growth with 
respect to the observed variables, and to understand which model can be 
adopted to describe the dynamic of the growth. Given very short time series, 
with variable time lags, we found very hard, or even impossible, to deal with this 
data with proper dynamic models, so that we preferred an approach based on a 
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nonlinear growth model, which has the advantage of summarising the behaviour 
of each individual with a small set of parameters easily interpreted. 

For sake of simplicity, and only to look for simple descriptive quantities which 
can summarise such complex data, we tried to fit, for each individual and for 
each variable, a general nonlinear growth curve of the family of Von Bertalanfiy 
curves (Von Bertalanfify, 1957), that is a three parameter exponential curve: 



E(y,)=Y+a(l-e-P‘) (2) 

Of course, the whole human growth can not be well described by only 3 
parameters (Tanner, 1981): many curves have been proposed for the description 
of human growth even with seven parameters (Jolicoeur, Pemin, and Pontier, 
1988); however, our data set concerns only the first years of human life, when 
growth speed decreases and this aspect is satisfactorily described by simple 
models. The model (2) has an easy interpretation since y is the value of y at 
birth, a is a scale parameter related to the whole growth and P depends on the 
logarithmic growth speed: the individual fits have resulted generally better than 
those obtained by Gompertz or logistic curves. 

The whole model is: 



yijh=7ui+ctih[l-exp(-Pihtij)]+8ijh i=l,2,...,n; j=l,2,...,mi; h=l,...,k (3) 

where yijh is the value of the h-th variable observed at the j-th occasion ty of the 
i-th individual, ya,, ajh, Pui are the parameters of the i-th individual and h-th 
variable, 8yh is the random error. 

A peculiarity of growth curves is the possible presence of serial correlation 
between the measurements of an individual (Palmer, Phillips and Smith, 1991); 
so we assumed that random errors Syh are normally distributed and the generic 
random vector 8a,, (constituted by the mi errors of the h-th variable and the i-th 
individual) has covariance matrix: 

E(8a,s'ai)=CT^aiRa„ (4) 

where is the common variance and Ra, is a correlation matrix with generic 
(j,s) element pa,j^ representing the correlation between elements of 8a, at times tj 

and ts. Of course, we need a model for the autocorrelations, in order to employ a 
limited number of parameters. Since the times ty are not equally spaced, we 
could not employ ordinary discrete time ARMA models, so that we modelled the 
autocorrelations according to an exponential decay (Diggle, 1988): 

Paijg E(8yh 8jsh)/^ a, Pih^ 



Pai^O. 



( 5 ) 
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This is the autocorrelation fiinction of a continuous AR(1) process (Jones, 
Ackerson, 1990), which allows only non negative serial correlation. At the first 
stage, the autocorrelations p* have been supposed different for each individual 
and each variable. Finally, we supposed that random errors Bih are not correlated 
among different individuals and different variables. Individual correlations 
among variables are taken into account in the systematic component of the 
model (3). 

4. Estimation of the parameters of the model 

With the assumptions of the previous section, the log-likelihood fiinction la, for 
the mi data of the i-th individual and the h-th variable is given by: 



lih(Clih,YihjPih,PihjO ihlyili) 

=-n log(a\)/2 - log(|Rih|)/2 - (ya, -fih)’Rai''(yih-fih)/(2a\) 

(i=l,2,...,n;h=l,...,k) (6) 

being ya, the vector of observed data and fa, the vector of fitted data, depending 
on the unknown p^ameters aa,, ya,, Pa,, according to the model defined by (3), 
and Ra, is defined through the relations (4) and (5). 

In order to estimate the whole set of parameters, we have to maximise the above 
quantities; for sake of brevity we do not report in this paper the explicit 
expressions of the inverse and the determinant of Ra,, since simple expressions 
are given by Nuilez-Anton and Woodworth (1994); in fact, as in the usual case 
of equally spaced times and discrete time AR(1) process, the inverse of Ra, is a 
tridiagonal or Jacobi matrix depending only on pa, and the set of ty, while its 
determinant is given by a simple factorisation. 

As usual in ML estimation in regression models, we can estimate as an 
explicit function of the other parameters aa,, ya,, Pa,, Pa, and then maximise the 
likelihood concentrated on the latter set of parameters. In fact, the MLE s\ of 
the variance component a^a, is: 

s\(aa,,ya„Pa„Pa,)=(yai-fai)'Ra,‘\ya,-fa,)/n, (7) 

so that by substitution in la,(.) we have the concentrated log-likelihood: 



la,(aa,,ya„Pa„Pa„s\(.))=-nlog((ya,-fa,)'Ra,’‘(ya,-fa,)/n)/2-log(lRa,l)/2-n/2 (8) 

which is maximised with respect to aa,, ya,. Pa,, Pa,, with ordinary optimisation 
methods. When we deal with models with reduced sets of parameters or 
however with constrains on the parameters, overall sample likelihood has to be 
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used: of course the whole log-likelihood is obtained adding la,(.) for all values of 
i and h, since we supposed the independence of the random errors among 
individuals and variables. Specific values of the parameters could be tested by 
comparing the unconstrained maximum with the maximum obtained imposing v 
constraints to the parameters and then using the LR (Likelihood Ratio) test; 
asymptotically -21og(LR) follows a distribution with v degrees of freedom, 
but unfortunately the number mi of observations for each individual is generally 
too small in our data set, so that the ^ approximation to LR could be used only 
to give a rough judgement on the reliability of specific hypothesis. 

5. Application to a real data set 

The main aim of the proposed parameterisation for our data set is to deal with a 
2-way data set, because the parameters of the systematic part of the model, ya,, 
otih, Pih, summarise the third way, i.e. time. In a first stage we applied the above 
parameterisation to our data set, obtaining a set of 3xnxk parameter estimates; 
in fact we have 3 estimated parameters for each of the n=64 individuals and for 
each of the k=2 variables (height and weight). The analysis of the relationships 
between the estimates suggested some reductions in the number of parameters; 
for the i-th individual we put: 

Pii=Pi 2 =Pi (autocorrelations are equal for the two variables but generally 
different among individuals); 

Pii=Pi 2 =Pi (equal individual growth speeds for the two variables but generally 
different among individuals). A similar simplification is used by Lundbye- 
Christensen (1991). 

Indeed the last simplification is also strongly suggested by the data, as well as 
the need of using all the information at disposal to estimate individual growth 
speeds; in fact the height has a 12% average percentage of missing data. 
Furthermore, the strong internal (infra-individual) linear correlation between the 
height and weight suggested us this simplification. The decrease of likelihood of 
this simplified model was not significant, so that we summarised the data set by 
means of the estimates of the five parameters of the systematic component: aji, 
Yii, ai 2 , Yi 2 and the common slope pj. This estimated common individual slope Pi 

A A 

resulted to be highly correlated (R=0.95) with the individual slopes Pii and Pi 2 
estimated separately for the two variables. 

The data did not present any evidence of difference between male and female 
parameters. Two individual parameter estimates appeared to be very far fi"om the 
bulk of the data, so that we eliminated them from subsequent stages: they belong 
to children for which the above assumptions lead to unrealistic parameter 
estimates, since their observed growth curves are almost linear. In table 1 we 
report the mean and standard deviation of the individual estimates of the 
parameters, computed on the remaining 62 subjects. 
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Table 1: Mean and standard deviation of the individual parameter estimates, 
computed on 62 subjects 



Estimate 


A 

Yii 


A 

ail 


A 

Ta 


A 

ai2 


A 

Pi 


A 

Pi 


Mean 


4.48 


16.69 


54.58 


64.78 


0.05 


0.42 


Std. Dev. 


1.19 


5.48 


4.99 


13.97 


0.06 


0.20 



An interesting aspect is the strong non-normality of the joint distribution of the 
estimates, as can be seen from figure 3, where we plotted the pairs of values of 
the estimates of aii and Pi. We see some evidence against the joint normality of 
the sampling distribution of the estimates, as it can be expected given the 
intrinsic nonlinearity of the model (Seber, Wild, 1989) also from figure 4, where 
we reported the likelihood contour plot of the 22"‘‘ individual with respect to 
same pair of parameters (aii and Pi, with i=22). 



Figure 3: Plot of 62 pairs of 
estimates of a,t (x-axis) and p, 




Figure 4: Likelihood contour plot of 
the 22"*^ individual for the 




6. Conclusion 

The above analysis shows that the model (3), together with the assumptions 
made on serial correlations, is suitable to analyse the growth of children. The 
data have suggested some reduction on the number of parameters; in particular 

A 

we estimated a common individual slope Pi and a common individual serial 
correlation Pi for both variables; even if there is still a strong collinearity among 
the estimates of the remaining parameters, in the present paper we do not 
mention any further reduction of parameters. 

A strong non normality of the sampling distribution of the parameter estimates is 
suspected, as usual in intrinsic nonlinear models. 

The obtained promising results induced us to deal, in a forthcoming paper, with 
random coefficient nonlinear models, in order to better deepen the study of the 
variability among individual growth parameters. 
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Abstract: This paper proposes a method for the reconstruction of missing data 
in a three-way data array, based on six modified procedures of the optimum 
Kalman filter in relation to the structural data analysis. The case study regards 
environmental data on sea water pollution observed in the Adriatic sea. 

Key Words: Kalman filter, state-space model, missing data, three way 
environmental data matrix. 



1. Introduction 

The aim of this paper is to propose a methodology for the reconstruction of 
missing data in a three way data array, supposing that the environmental 
observations characterised by three co-ordinates variables, space and time, 
are represented through a state-space dynamic system. 

The above mentioned representation, based on a particular Markovian stochastic 
processes, allows the change of state from past to fixture information through an 
optimum filtering proposed by R. E. Kalman, to study the system theory 
(Kalman and Bucy, 1960). Such a dynamic system results completely specified if 
it is structured in terms of two equations, where the first, namely of transition or 
state equation, shows the functional composition of the state parameter, while 
the second equation, namely of observation, tries to forecast the future data or, 
as in our case, to reconstruct the missing information. As far as we know, the 
treatment of missing data on three-way arrays has not been fully studied and 
general and suitable techniques have not been proposed. The most frequently 
adopted techniques in time series analysis, use methods linked to ARIMA class 
models, with stationary and isotropic restrictions as well as equally spaced units 
of measurements of variables. 



’ * The paper has been supported by a grant MURST 40% titled “Analisi dei dati spaziali”, 
national coordinator Prof, ^uro Coli. 




256 



Fig. I An Exemplification of three way data matrix object of study. 




Y={ where i = 1,2,. .,V variables; j = 1.2 ,R zones ; t - 1,2,..,T times;} 



2. The model 

Any interpretative dynamic model of three-dimensional data, should be able to 
reproduce, with its own structure, the variables interrelationship in any of the 
three directions of the data set. In particular to analyse a phenomenon 

characterised by a three-way array Y={y(jj,t) ; i=l,2,..,V variables; j=l,2, ,R 

zones; t=l,2,..,T times;} the missing part can be completely reconstructed 
through a procedure that requires the collapse of one or two dimensions (fig. 
1). Choosing, for example the i-th variable, the corresponding slice ai(T,R), 
parallel to axes T and R, is a space-time matrix, where the bidimensional 
optimum Kalman filter may be applied, to reconstruct the missing data. 
Similarly, choosing the j-th zone, the corresponding slice Pj(T,V) parallel to axes 
V and T, identifies a matrix containing a multiple time series, on which the 
Kalman filter for autoregressive vector models (VAR) may be applied. 

In our paper, we intend propose six procedures: four are based on a 
bidimensional Kalman filter, and two are based on the ordinary Kalman filter, in 
which, the first is an ARIMA model and the second is an autoregressive vector 
model. Moreover, the procedures, whose names include the letter S, we have 
applied a smoothing algorithm. 

From the above considerations, a flexible and general approach for the 
reconstruction of missing data in a three-way array can be carried out 
considering a linear model of two equations expressed in vectorial terms. The 
first, state equation, is ,while the second, observation 

equation, is Y^ = A’^'P + HjX^ + , where O H and A’^ are parameter 

matrices; W, e V, are observations and state Gaussian zero mean noises with 
covariances equal to Qw and Qv; Y is a vector of exogenus or predetermined 
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variables. In such terms, the Kalman filter can be expressed through two phases; 
forecasting and updating, and determines the optimal estimate of the state vector 
Xi, whenever new information becomes available. The optimal estimate of 

is given by: Xj+j/s = OX^/^ , while the covariance matrix of the forecast error is 
Pj+i/s = <DPs/sO’+Q„ . These two equations are known as forecasting equations. 
Once the new information Ys , becomes available the estimate X^/j , can be 
updated. The updating equations are: = X^3 .i+kJy3-HX^,.i] 

andPg /3 = [l-KjHjJPj/j.j, where Kj is the Kalman gain matrix. It should be 
pointed out that, when we have missing data the updating equations 
become: and P^/, = P^/^.j . 

In some cases the state vector can be interpreted in structural terms, so it is 
more appropriate to estimate its value at a particular point, using all the 
information and not just a part of it. Such an inference is called the smoothed 
estimate, while the corresponding estimator is called smoother. Since the 
smoother is based on more information than the filtered estimator, it will have a 
mean squared error which, generally is smaller than the filtered estimator. In 
statistical literature several smoothing algorithms in linear models have been 
proposed, and in our paper we will use fixed-interval and fixed-lag smoothing 
algorithms. The first computes the full set of smoothed estimates for a fixed span 
of data and implies a backward recursion of the Kalman filter. The latter 
algorithm computes smoothing estimates for a fixed delay and runs in parallel 
with the Kalman filter (Anderson and Moore, 1979). It is clear that when we 
apply the bidimensional Kalman filter, the smoothing procedure is characterized 
by a considerable computational effort, infact in each iteration of the backward 
procedure both state vector and covariance matrix of the forecast error must be 
stored. 

As we can see in fig. 1, the slice ai(T,R) is referred to a space-time matrix 
associated to the i-th variable and we can apply the ordinary unidimensional or 
bidimensional Kalman filter method. A key feature in two-dimensional 
application is that there is great freedom in deciding which data are to be 
considered available for processing. As in the 2-D signal processing literature 
the area of the lattice (slice ai(T,R)) which is used to process the data is called 
“the support region”. For example, in the whole slice ai(T,R) the smoothing may 
be performed on the basis of a support which in principle involves all the data. 
In practice, considering that the filtering is optimum only for a minimum 
dimension of the state vector, only particular supports are meaningful in the 
prediction and filtering tasks. To by-pass the high dimension of the state vector 
problem, the model for the reconstruction of the missing data that we propose, 
regards the Ordinary Reduced Kalman Filter-Smoothing (ORKFS) (Fig. 2). 
Next, the fixed lag smoothing will be applied. In this case, the updating 
procedure is based on a limited number of information near to y„iO>t) 
regarding slice a of the i-th variable at the t-th instant and in the j-th zone. Thus 
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we choose to update only those elements of the state vector within a fixed 
distance fi'om y„; (j, t) . We expect this procedure to give a good approximation 
because significant update will be confined to a region around the observation 
y„i(j,t). Therefore, omitting the update of distant elements should only 

minimally impact the performance. Since in our case the filter needs a direction, 
we will assume that the updating is done starting with the upper left axis, 
moving along observations from left to right and row after row (Woods and 
Radewan 1977) (fig. 2.b). 

Fig. 2 Restriction dependence for ORKFS and OMKFS 




(b) (c) 

The support region Rm related to the ORKFS is defined by: 

RM(i,t) = [(t-g,j-h)|(l<h<M;0<g<M)u(-M<h<0;l<g<M)] 
where M =1,2 defines the order of the recursive model. 

As an explanation. Figure 3 shows the support region Rm for a first order model 
(M=l). 



Fig. 3 Support Region Rm for a first order model 




This allows us to define the generic elements ya|(j.t) through the following 
state equation; x„ (j,t)= Y(|)hgX„i(j-h,t-g) + w(j,t) where represent 

(j-h;^g^RM 

the coefficients that regulate the relation between x„i(j,t) and x„i(j-h,t-g) 

and w(j,t) is the realisation of a Gaussian stochastic field. 

The resulting ORKFS equations can be written in a scalar form as given below. 
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In these equations the subscript “a” and “b” indicate “after” and “before” 
updating, respectively; the argument represents the position of the data on slice 
ai(T,R). State prediction and update are as follows: 

^ai(j.t)= S‘l>hg^ai(j-h,t-g), (1) 

0-h,t-g)eRM 

x“i (m, n) = (m, n) + k(j - m; t - n)[y^ (j> t) - (j, t)] , (2) 

with: (m, n) e = {m > 0, n > 0} u {m < 0, n > O}. (3) 

It should be noted that the estimate defined in the last equation can be a filtered 
estimate (if j=m, t=n) or a fixed lag smoothed estimate (if (m,n) occurs prior to 
(j,t)). Since the implementation of the ORKFS requires the coupling coefficient 

vectors (j) and the process noise variance Q„, the general and Bias-Compensated 
Least Squares is used to identify these quantities directly from the data (as 
suggested by H.Kaufinan, J. Woods et al. 1983). 

In order to estimate Q», consider the following expression: 

P = E[yaiO,t)-<l)'^y^rO-i.t)J^ > (4) 



where: 

y^“G-l.t) = x^“(j-l.t) + Vt,i(j-l,t), (5) 

and x^“ (j - 1, t) is a vector consisting of those elements in Rm- 

Expansion of the above expression, taking into account the fact that v„i(j, t) is 

an uncorrelated zero mean sequence, gives 

P = e[x<^ G. t) - 'I'^Xar (j - 1 . t) + v(j, t) - v(j - 1 , t)J 

E[wO,t) +v(j,t)- <|)^vG-l,t)f =Q„ +Q^ +<l>^Qv<l' , (6) 

thus 

Q„=E[y<,G.t)-<|)MrG-l.t)r-Qv(l + 'l>^'l>) 

Consequently, this result suggests the following procedure for identifying Q„: 

1) Collect a representative set W of data { y^ = (j,t):G,t) ^ W }, and compute 
an estimate of ^ with the procedure to be discussed subsequently. Next 
approximate P with an estimated value? , namely: 




260 



a.t)eW 

where denotes the number of data in W. 

2) Compute an estimate of Q«, as follows; 

Q,=P-Qv(l+*’^(|i). (8) 



The most straightforward procedure for identifying the coefficient vector <j) 
is to perform a least-squares fit over a representative block W of data using the 
observations y„i(j,t) in place of the true densities x„i(j,t). 

That is, the estimate of <|) would be determined so as to minimize 

Z((Ta.(j,t)-<|)V^“(j-l,t))y. 

(j,t)eW 



Setting to zero the gradient of P with respect to (|) results in: 



4,= 


Zy^“(i-^>t)y^?‘^(j-l,t) 


-1 






_a,t)eW 




_at)eW 



It should be noted that, since the term that multiplies (|), contains 

noise, the estimate will be significantly biased. 

In order to reduce this bias, the following correction is suggested: 



Zyai(j>t)Ta 

(J.t)eW 



(j-l,t)-IN„Q, 



Zyai(j.t)yoi(j-l>t) (10) 

(j.t)eW 



where I is the identity matrix. 

By changing the dependence restrictions, the Ordinary Reduced Kalman Filter 
(ORKFS) procedure (fig. 2.b.), can be repeated starting from each vertex of the 
slice. The final estimate of the missing data is an average of the obtained results. 
This new procedure is a modification of the (ORKFS) and has been called the 
Ordinary Modified Kalman Filter-Smoothing (OMKFS) (fig. 2.b). Keeping in 
mind slice ai(T,R), an alternative procedure called RWKF (Reduced Weighted 
Kalman Filter), consists in highlighting an eventual recurring data structure. In 
such way, the matrices A\ H and can be useful for weighting the estimates 
obtained with the ORKFS procedure with seasonal index, temporal and/or 
spatial mean. Again, like before, by changing the dependence restrictions, the 
procedure can be repeated starting fi'om each comer of the slice (procedure 
MRWKF). The last two proposed procedures require an univariate and 
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multivariate time series analysis. Particularly, on slice Oi(T,R), we can identify a 
time series for each zone and utilise both ordinary Kalman Filter and fixed- 
interval smoothing for ARIMA class models (FKARMAS). On the other hand, 
on slice Pt(V,T) we can filter with an autoregressive vectorial model (VAR), 
and one time too, we are able to apply the Kalman filter and smoothing 
algorithm (procedure FKVARS). 



4. The case study 

The data of this study, are related to the results of a monitoring project and 
concerns the presence of some polluting substances supposed responsible for 
the Eutrophications phenomenon. The three-way array proves to be defined by 
10 variables, 17 zones and 49 times. From the above mentioned matrix, several 
observations were eliminated in such a way as to internally obtain a “cloud” of 
missing data. This cloud was then reconstmcted using the six adopted 
procedures and the goodness of fit was evaluated using the R squared measure. 
The following tables show the measure and ranks of die R squared, calculated 
for our first five procedures and for each of the ten observed variables. In our 
case study, looking at both tables, the best method results the MRWKF 
procedure. This confirm that the previous study of the data structure 
composition is fundamental to obtain the best reconstmction of the missing 
data. 



Tab 1. Values of R squared calculate for the first five procedures for each of 
ten variables 





Variables I 


Procedures 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


ORKFS 


0.90 


0.95 


0.90 


0.93 


0.96 


0.92 


0.90 


0.96 


0.97 


0.90 


OMKFS 


0.95 


0.89 


0.95 


0.92 


0.95 


0.95 


0.91 


0.90 


0.94 


0.86 


RWKF 


0.91 


0.78 


0.92 


0.91 


0.72 


0.97 


0.97 


0.94 


0.90 


0.92 


MRWKF 


0.87 


0.84 


0.91 


0.89 


0.90 


0.99 


0.95 


0.96 


0.98 


0.97 


FKARMAS 


0.92 


0.93 


0.87 


0.85 


0.68 


0.90 


0.86 


0.83 


0.88 


0.84 



Tab 2. Ranks of the R squared obtained with each of the five procedures 





Variables 

123456789 10 




Procedures 


Average Ranks | 


ORKFS 


o 


11 


B 


B 


B 


B 


B 


B 


B 


B 


2.5 


T 


OMKFS 


Q 


Q 


B 


B 


B 


B 


B 


B 


B 


B 


2.6 


30 


RWKF 


n 




B 


B 


B 


B 


B 


B 


B 


B 


2.9 


40 


MRWKF 


o 


o 


B 


B 


B 


B 


B 


B 


B 


B 


2.4 


1 ° 


FKARMAS 


o 


B 


B 


B 


B 


B 


B 


B 


B 


B 


4.2 


5° 



As demonstrated in the table 3, even the last procedure (FKVARS) based on the 
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VAR models, shows for a multiple time series, a satisfying reconstruction 
capacity of missing data. 

Tab. 3 R squared obtained using the FKVARS procedure 





Zones with missing data | 


1 


7 


8 


9 


10 


11 


12 


13 


14 




0.89 


0.91 


0.93 


0.94 


0.79 


0.97 


0.91 


0.86 
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Abstract: In this paper we present two methods to analyze three-way data ar- 
rays with double neighbourhood relations. The first procedure use Kronecker 
product between graph matrices to construct a neighbourhood operator. Some 
of the most significant eigenvectors of this operator allows modelization of the 
underlying phenomena. The second methods make Kronecker product between 
neighbourhood operators of each graph matrices and is equivalent to a particular 
STATIS. A comparison between these two procedures on ecological data set is 
then performed. 

Keywords: Graph, Geary coefficient, Kronecker product. Three- ways data, STA- 
TIS method. Neighbourhood operator. 



1 Introduction 

In the analysis of three-way data arrays, and also two-way matrices, relations 
between observations are unfortunately often ignored. Following the works of 
Lebart (1969), Cliff and Ord (1981), Le Foil (1982) and more recently Mdot 
et al (1993), proximity between observations can be used in a multivariate data 
analysis framework. Extension to three-way data with double neighbourhood can 
be done in a natural way (Comillon et o/.,1993). In order to understand what is 
double neighbourhood, let us consider some variables collected in different sites 
at different times. To take into account the relationship between two sites (for 
instance the distance) it is useful to consider a neighbourhood matrix. Another 
one is obviously needed to modelize time relations between observations. The 
ecological data set we will study presents a similar structure as described in Fig- 
ure 1. In section 2 we first recall a definition of neighbourhood matrix and local 
variance (Lebart, 1969). From these definitions we introduce the neighbourhood 
operator and give a brief review of methods using it, with a special emphasis on 
Geary’s index (1954) and local variance (Lebart, 1969). This method, after a 
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diagonalization of a particular operator, produces an orthogonal base with opti- 
mal autocorrelation properties. The extension of this methodology in a three-way 
data context lead us to introduce two different methods. Both approaches take 
into account the two neighbourhood matrix, by either the direct Kronecker prod- 
uct of the graph matrices or Kronecker product of the neighbourhood operators. 
The first analysis generalizes the Upton and Fingleton approach (1985) while the 
second one is directly connected to the STATIS method (Lavit, 1988). 



2 Some neighbourhood analysis in the two-way case 



To define the neighbourhood relation between p statistical units, we can use a 
symmetric neighbourhood valued graph (e.g. Lebart 1969, M6ot et al. 1993). 
In the sequel we will restrict ourselves to unvalued graph. We can associate to 
these graph a boolean symmetric matrix M of order p, which general term is 
(1 < * < p) such that rriij = 1 if vertices i and j are linked, zero otherwise. 
Let X be the vector of the p observations of the variable x. 



Figure 1: Frame of the ecological data set 

SfHlis.) I 

neighbourhood | 
fnalrix 

P 




This contiguity information can be introduced into empirical variance of variable 
X, by a suitable decomposition of ^^(x) in two terms: the first one depends 
on the graph and is called the local variance while the second one depends on 
the complementary graph. Let define a weight matrix of statistical units D = 
diag{di,d2, • • • , dp), the empirical variance can be written as : 

= lEdidj{xi-xj)^ 

hO 

= 2 TUijdidj (^X{ — Xj) 2 '^ij)didj — Xj) . 

id ij 

Putting D* = diag{dl^ ^ 2 ? * * ‘ ? d^), where d* = rriijdj, which is the sum of 
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weights of vertix i’s neighbours, we can write the last expression as: 

5^(x) = x'DQox = x'DEx + x'D(Qo - E)x, 

where E = D* — MD is called the neighbourhood operator associated with the 
matrix M, and Qo is the projector on the orthogonal complement of the subspace 
spanned by 1 6 IR^. The matrix Qo — E is also an neighbourhood operator, that 
is a D-symmetric positive matrix A which verify sup|| 3 j||jj<i (x | Ax)jj. 
Diagonalization of X'D AXQ where Q is the metric of the space of variables lead 
us to consider a Principal Components Analysis (PCA) of proximities and give 
principal components which maximizes the following criterion: sup||all^_i (AX 
Q(a) |XQ(a) )j, (M6ot et al. 1993). This can be interpreted as a linear combina- 
tion of the variables of X, namely XQ(a), which has a local variance maximum. 
If D = 1/pIp then the eigenvectors of E maximize the generalized autocorre- 
lation Geary’s index. Finally, Principal Component Analysis with respect to In- 
strumental Variables (PCAIV) on the eigenvectors of E can be performed. This 
analysis allows to explain the data matrix by the space spanned by the vectors 
of generalized autocorrelation Geary’s index maximum (see Mdot et al. 1993 or 
Escoufier, 1987). 



3 Extension in the three-way data arrays case 

Let Xtxpxn. a three-way data array. We can associate to X two contiguity ma- 
trices Mg and Mt, one for each of the two first dimensions. We note Eg and E* 
their respective neighbourhood operator. We can vectorialize the two-way kX 
matrices of X along the third dimension: kX*^ = Vec(kX) and place this vector, 
by columns, in the super-matrix Y = [iX*^ • • • nX*^]. 

Hence we can apply the local framework to the matrix Y of dimension tp x n. 
In order to take into account the time and spatial contiguity relations into a vari- 
ance decomposition, we can consider the Kronecker product of the graphs. This 
approach generalize Mdot et al. (1993), using the matrix M of the graph given 
by Kronecker product of Mg and M*: M = Mg ® Mt. The resulting operator E 
verifies: 



E = D* - MD = D: 0 - (Mg ® Mt) (Dg ® Dt) 

If Y is univariate this approach is similar to the one proposed by Upton and 
Fingleton (1985). As mentioned before, we can consider a PCATV on the most 
significant eigenvectors of E, that is a regression of each variable on the space 
spanned by the chosen eigenvector of E. Hence we try to fit Y with the most 
autocorrelated variables in the sense of Geary’s index for the given matrix M. As 
in principal component regression we only keep m <n axis and the model could 
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be written as: 

Yk = H anC^™^ 

where are the m first axis of the PCAIV, 

and Yk is the fitted model for the variable. 

As a particular case, if we consider two complete graphs {rriij = 1 , V(i,i)), this 
approach yields to a PCA of the centered matrix Y. 

We can consider a different approaeh: the Kronecker product of the operators 

Es ® Et: 



E = D* (8) D* - D* (8) MtDt - MsD^ (8) + MsDs ® MtDf 

By analogy with the local variance, which in the two-ways case could be written 
as: = y'DEy, we can consider another type of local variance: 

Ikk = kX*^' (DgEs ® DtEt) kX*^ = tr (kX'DgEs kX DtEt) . 

This value can be interpreted as the product between matrices in kX, or, as the 
scalar product between columns of Y. Obviously, this scalar product take into 
account the double neighbourhood structure. These remarks lead us to see this 
method as a STATIS procedure with particular weight metrics DgEg and DtEt. 
Hence the matrix F of general term jki represents the variance matrix of the 
interstructure stage. The others steps of the STAITS method remain unchanged 
(see for instance Lavit, 1988). We can notice that, if we have two complete 
graphs and if every kX is both centered by colunrn and centered by row, then the 
coefficient 7 */ can be written as 7 *; = C'ou(kX', jX*^). 



4 Adriatic sea pollution 

In order to analyze the Adriatic sea pollution we have considered a set of 10 
physico-chemical variables (Temperature [Temp], Salinity [Sal], Transparency 
[Tran], Chlo-rophyll [ClorA], Ph [PH], Ammonia [Ammo], Nitric Nitrogen [AN- 
ITRC], Nitrous Nitrogen [ANITRO], Ortho Phosphorus [FosOrt], Global Phos- 
phorus [FosTot]) sampled in 17 different stations along the Abruzzo coast at 49 
different dates. The three-way data set dimension was f = 49, p = 17, n = 10. 
Moreover these data set was sampled at two different distances from the coast 
(500 and 3000 meters). To take into account the time and spatial contiguity rela- 
tions we used two different linear neighbourhood graph matrices. Every sampling 
time is placed along a line and it is linked to another one if the segment of the 
line between them do not include another sampled time point. A more complex 
graph is constructed for the spatial contiguity: we consider the mutual distances 
between stations; two stations will be connected if their distance is less than the 
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greatest distance between two following stations. A more graphical explanation is 
in Figure 2. We have developed two different analysis, each for different distance 
from the coast: the Kronecker product of the graphs and the Kronecker prod- 
uct of the operators. For the first approach, starting from the contiguity matrices 
we compute the Kronecker product between them, obtaining the neighbourhood 
operator E. We diagonalize this operator and the eigenvectors associated will be 
utilized for an PCATV with respect to these eigenvectors. That is we try to explain 
the variability of Y by the most autocorrelated variables in the sense of Geary’s 
index. In order to choose the most significative eigenvectors we have performed 
a simple regression of the 10 variables on each eigenvector. 



Figure 2; Neighbourhood matrix construction 
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Figure 4 displays the plot of the variables FosOrt and Temp at 3000 m, and the 
model fitted for the first component of the PCAIV (m = 1). As we could see, on 
one side the model selected has good ability to explain the variation of FosOrt, 
but on the other side some of the variability is not captured for the temperature 
(Temp). The one dimension model achieve a good explanation of the global vari- 
ation and allow to detect unusual comportment. 



Figure 4: Data points and fitted data using one-dimensional model (solid lines) 




The second approach (Kronecker product of the operators) leads us to a STAITS 
on the array X’s with semi-metrics DgEg and DtEt. Obviously the interstructure 
step is equivalent to a PCA of the statistical triplet (Y, A, N) where A = DgEg® 
DtEt is a semi-metric for 1R^‘ and N define a metric in IR", (see Lavit, 1988). 



Figure 5: Interstructure with and without double neighbourhood. 
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The plot of the n = 10 variables from interstructure step allows to detect a com- 
mon structure between variables at different distances and different times. We 
can notice on figure 5 that variables’ clusters at distance 500 m, are far more dis- 
tinct when we use the neighbourhood constraints. This effect disappear when we 
consider the distance 3000 m. This is the direct consequence of a main ecological 
difference between 500 and 3000 m: at 3000m away from the coast this is open 
sea and the dilution of the pollution is far more efficient than at 500m. 

A “mean” (called compromis) of all the n arrays is computed at the compromise 
step. From this array we can extract principal components and axis which are a 
new basis for their respective space (i.e. 1R‘ and IR^). Using this basis, a plot of 
each array (one per variable) can be done - these plots are called intrastructure 
step. For instance we can notice that for Nitric Nitrogen (figure 6) all the stations 
have different correlation between them for 500m and 3000m. Fine tuning this 
interpretation yields to note that stations 1 and 8 have both a good projection in 
the 500m and 3000m cases. That suggest they have a global behavior (in the time 
sense) close to the “mean” (i.e. compromis) behavior for this variable. Of course 
similar plot can be done with the other variable. 




5 Concluding remarks 

In order to euialyze the Adriatic sea pollution we have developed two different 
analysis; the Kronecker product of the graphs and the Kronecker product of the 
operators. We have noticed that for the first method difficulties arise from the 
choice of components used in the model: obviously different components yield to 
different results. A better criterion than the so-called “periodograph” is needed, 
to improve the ability of this method to fit the data set. This second approach 
seems to be a good exploratory tool in the double neighbourhood framework: it 
give better results than standard STATIS and, moreover, is easy to implement on 
a computer. This study was realized on standard S-Plus software and functions 
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are available from the second author. 
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Abstract; This paper deals with an “ecumenical” approach to the productive 
processes analysis. In particular we suggest a research strategy based on five 
known estimation methods for the production frontier function one of each 
characterised by a proper type of flexibility in results evaluation. In such a way, 
we obtain through a bootstrap methodology, an interval estimation of the 
efficiency score of any firm. 

Key words: Productive performance. Efficiency, Frontier Function, Bootstrap, 
Panel Data. 



1. Introduction 

In production analysis performance measures have the task to measure the 
“capacity” of a economic unit to transform input into output and to point out the 
achievement degree of a given production standard considered as the optimal one. 
Such a standard is normally expressed by a function representing both a 
production or a cost function. 

Three different approaches are generally used for frontier estimation in 
international econometric and statistical literature. 

• Deterministic approach’: all the observations are supposed lying below the 
frontier and the error term (one sided) capturing inefficiency only; 

• Stochastic approach: observations are affected by an error term made up of two 
independent components one representing the statistical noise and the other 
inefficiency (Aigner, Lovell e Schmidt, 1977); 

• Non parametric approach: no distributive hypothesis is advanced for the 
frontier shape. In some applications only frontier convexity is assumed. On the 
contrary Free Disposal Hull (FDH) technique, we will consider in this paper, 
ignores convexity as well. 



' In the sense of Cornwell and Schmidt (1996): “...technical efficiency for each firm can be 
calculated essentially as in the COLS procedure for the deterministic frontier” . 
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This paper deals with an “ecumenical” approach to the productive processes 
analysis. In particular we suggest a research strategy based on five known 
estimation methods for the production frontier function one of each characterised 
by a proper type of flexibility in results evaluation. 

This is particularly relevant when a performance analysis of several firms is 
conducted using panel data; in this case we have to deal with productive processes 
dishomogeneity. Using methods characterised by different degree of flexibility in 
terms of efficiency, it is possible to control such dishomogeneous behaviours 
avoiding heavy estimation biases. 



2. Frontier Estimation with Panel Data 

Consider a production function 

l',=a + x„/?+£„, (1) 

where indicates the proper production function for the /-th sample firm 
(i=J,2,...,N) in the /-th period (t=l,2,...,T); x^,is a vector (ixk) of proper input 
function associated to the /-th sample firm considered at time / (the first element 
should be one); P is the vector {kxl) of the regressor coefficients, a is a 
constant. 

Two different forms can be used to represent the error term: (i) in the 
deterministic approach only the simple condition < 0 is used since it represents 
inefficiency only; (ii) in the stochastic approach, proposed independently by 
Aigner, Lovell e Schmidt (1977) e Meeusen e Van Den Broeck (1977), the error 

term is divided into two components - f/, ; V^, are assumed as N{o,al^ 

independent random variables, identically distributed (IID) and independent fi’om 
U^ random variables; t/, are also still IID and non negative and defined by the 

truncation (at zero) of the distribution. In addition, it assumed that the 

Vj, e t/, random variables are independent of the input variables in the model. 

Coming back in the deterministic approach, if we have more than one 
observation for each unit and we suppose that each unit has its own characteristics 
different from the other units, it is possible to add a t suffix to every observation 
and to introduce an individual effect (a, ) constant in time but different for each 
unit. With the following model 

Y,=a,+xJ + V„ ( 2 ) 



where a^ - a - . As it is well known the efficiency score (Technical Efficiency 
Degree-TED) of every unit (firm) under the hypothesis of time constancy can be 




273 



obtained by two estimations strategies referred to model (2): i) within estimator 
and ii) least squares duimny variables, LSDV (Baltagi, 1995). 

Among parametrical deterministic methods, we mention Aigner e Chu (1968) 
proposal. Their model can be written as (1) and the vector of parameters can be 
estimates via linear or quadratic programming. In other words minimising the sum 
of the residuals absolute value under the constraint that every residual is non 
positive. 

TED can be directly estimate by the residual vector. With panel data it is 
possible to obtain a global efficiency score as an average of annual TEDs. 

For the stochastic approach still assuming a time constant TED we referred to 
a the classic Battese e Coelli (1988) method. 

They consider an error term is divided into two components - (/, : Vj, 

are assumed as N{0,al^ independent random variables, identically distributed 
(DD) and independent from (7, random variables; f/, are also still HD and non 
negative and defined by the truncation (at zero) of the distribution. 

Given all the distributive assumptions, joined density function of the vector 

can be derived, and it is possible to obtain 
maximum likelihood estimator (MLE) of parameters (a,P and the parameters of 
distributions of V and U). 

Finally TEDs are calculated as C/, = e{Ui\Vu -t/,)where 

Vj, - Uj are residuals Y„-d~ x\, ^ . 

Non parametric analysis is conducted as Free Disposal Hull approach (FDH - 
Deprins, Simar e Tulkens, 1984), modified by Tulkens (1993) in a panel 
framework. 

The last estimation method considered is the so-called semi-parametric 
approach proposed by Viviani (1996); it’s a two-stage estimation procedure: in the 
first step undominated (efficient) firms are determined via FDH filter. Then from 
the FDH data set just identified, is possible to estimate a function specification 
using OLS. If the FDH filter has satisfactory results, the error term, originally 
divided into two components E„ = V„ - U, , loses U. Only remain and they are 

n{0,ctI). 



3. A Bootstrap Methodology for Constructing Confidence 
Intervals for Estimated Efficiency Scores 

Define a set of panel data efficiency scores obtained from several estimation 
methods: 



TED = {tedji = = 1,..., m} 



(3) 
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i indicates the sample firm and m the estimation procedures for calculating TED: in 
this case M=5, \=within estimator (wit), 2=stochastic approach (sto), 3=non 
parametric approach (npa), 4=parametric-deterministic approach (pda) e semi- 
parametric approach (spa). 

In the discussion that follows, we consider the problem of computing a 

y M 

confidence interval for a set of firm means ted^, = — ^ ted^„, Vi = . 

^ m=I 

Assume that for each i=l,...,N TEDs represent a set of IID random variables 
{ted^„^’^ ^ , with constant mean, =£'(te4m)> constant finite variance, 
cr^^,. the Lindberg-Levy theorem (LET - Atkinson, Wilson, 1995) indicates that 
the sample firm mean ted^, is asymptotically normally distributed with and 
variance a^^,, / M, regardless of the distributions of the ■ 

An unbiased, consistent estimator of is given by 

-tedf,) . Thus, the variance of the sample mean ted., 

m=l 

can be consistently estimated by - dl^ /M . 

Given large M and the other assumptions of the LET, ={ted^, 
is asymptotically standard normal. Elnfortunately in our case, sample of TEDs for 
each firm{te4„}^, is small so that EET does not guarantee asymptotic normality 

for ted., . Fortunately, when we have small samples that are not normally 
distributed, we can use the bootstrap to obtain approximate confidence intervals 
for Pg^i^ . For a given i=I,...,Nwe have a random sample {ted^„,}^^^ . The following 
steps lead to a bootstrap estimate of the confidence interval for p^^^ (Atkinson, 
Wilson, 1995): 

j M 

1 . Compute the sample time mean ted^, = — ^ ted^,„, Vi = ; 

m=I 

2. Compute tedj„, = ted^,„-^jM / (M- 1) + ted^,{l-y]M / (M- ij ) ; 

3. Independently draw N times from the set with replacement, such that 

each observation has equal probability of selection, to obtain [get 

j M 

4. Compute ted^, *(j) = —J^ted*„; 

Mtr, 



^ The same hypothesis is made by Atkinson e Wilson (1995) to computing a confidence interval 
for a set of annual firm TEDs means. 
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5. Repeat steps (3)-(4) J times to obtain (get,, *{ where J is appropriately 
large in magnitude. 

The correction in (2) is necessary, however, to avoid type-I errors in small 
samples as proved by Atkinson e Wilson (1995). 

The bootstrap values approximate the exact small-sample 

distribution of ted ^. . Thus, the values in \ted^, *( can be sorted by algebraic 
value to construct confidence intervals for via the bootstrap percentile method 
described by Efron (1982). Letting ted^, denote {lOO x a)th percentile of the 
J bootstrap replications \ted^, *( j ^ , the percentile method gives the bounds 

^tedf,*^“\tedj,*^’ for the \(l-2a)y.l00\ percent confidence interval for 

ted^, . In other words, the \(l-2a)x I06\ percent confidence interval for get^, is 

obtained by deleting oJ values from both ends of the sorted array of J 
bootstrapped values and taking the endpoints of the newly truncated, sorted array 
as the boundaries of the confidence interval (Atkinson e Wilson, 1995). 



4. Empirical Analyses 

Such theoretical results have been applied to a sample of firms operating in 
Tuscany in 1993/94 (Lemmi and others, 1996) mostly diflfiised and of relevant 
innovative interest. 

Such firms have been selected from official databases (called in Italy CERVED 
and held by Chambers of Commerce). They belong to the sector of manufacturing 
firms (textile, food, wood, mechanics, steel, chemical, etc.) and they have more 
than 6 employees to avoid fragmentation and recording problems. 

Models variables (Giusti 1994, Griliches, Ringstad 1971) are represented by: 

• Total Production Value (Y); 

• Labour Hours (L); 

• Capital Value (K); 

• Material Input (MP) 

Production function (Cobb-Douglas type) is represented by: 

LnY^^ = «; + PiLnL^j + p^Ln^/[P^^ + + [deterministic approach] (4) 

LnY.j = a + + PsLnK^, +V^f-U^ [stochastic approach] (5) 



Estimation results are contained in tables 1,2,3: 
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Table 1; Parameters of production function. 





WITHIN ESTIMATOR 

= 0,996) 


ML ESTIMATES (Battese and Coelli) 

(Log-Likelihood=286, 803) 


Variable 


Coefficient St.Error 


t-Statistic 


Coefficient StError t-Statistic 


Intercept 


- 


- 


0,950 


0,419 2,267 


LnL 


0,539 0,103 


5,225 


0,480 


0,095 5,031 


LnMP 


0,240 0,041 


5,854 


0,242 


0,048 5,042 


LnK 


0,191 0,048 


3,978 


0,220 


0,029 7,586 


P / GU 






2,141 


0,004 535,250 


(T^u / a^v 






23,582 


4,506 5,233 


afv) 






3,13E-02 


0,003 10,433 


Table 2: Two-stage estimation procedure (Viviani ,1996) 




OLS ON UNDOMBVATED FDH ‘93 


OLS ON UNDOMINATED FDH ‘94 




{R‘ =0,963) 






= 0,935) 


Variable 


Coefndent StError 


t-Statistic 


Coefficient St.Error t-Statistic 


Intercept 


2,131 0,411 


5,176 


2,094 


0,559 3,745 


LnL 


0,202 0,049 


4,055 


0,261 


0,066 3,901 


LnMP 


0,414 0,040 


10,320 


0,321 


0,046 6,839 


LnK 


0,294 0,048 


6,074 


0,338 


0,064 5,249 


Table 3; Parametric and deterministic approach (Aigner and Chu, 1968) 




Intercept 


LnL 


LnMP 


LnK 


YEAR 1993 


2,210 


0,316 


0,297 


0,378 


YEAR 1994 


4,178 


0,205 


0,383 


0,130 



N. W. : Estimates on a sample of 99 firms which provided their own budget 

Several authors consider FDH as the method leading to results very closed to a 
possible competitive measure in the market. Therefore with FDH is possible to 
compare market shares and thus product power. The non parametric procedure 
results can give an indication of the competitiveness of the considered firms; such 
indication is usually combined with a measure of the merely technical efficiency. 

Tab.4 shows some emblematic results. For privacy respect, individual data and 
complete results (available from the authors if interested) are not shown. 

Fig. 1 contains the TEDs averages and the confidence intervals for the firms in 
Tab.4. 

Interpretation of the figure is immediate. All the firms with an high efficiency 
score and a reduced confidence interval show substantial agreement of the five 
procedures. Therefore in this case the firm is not only efficient from technical point 
of view but is winner in direct comparison with the other local units. Therefore is 
also competitive. 

The contrary happens when a narrow confidence interval combines with a very 
low average efficient score. In case of firms with a wide confidence interval we 
have the following interesting situations: in fact the measures obtained using the 
five chosen procedures do not agree. This can mean either that firms are technically 
more efficient than competitive or, and it is the most frequent case, the exact 
contrary. 
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Te^o^^^^Resultso^£ve£stiim 



FIRMS 


TED^ 


TED„.d 


TEDrto 


TED™, 


ted.„ 


TED^d 


TEDii 


TEDui 


9 


0,14 


0,18 


0,16 


0,21 


0,43 


0,23 


0,16 


0,33 


10 


0,15 


0,20 


0,16 


0,22 


0,49 


0,24 


0,16 


0,37 


13 


0,25 


0,30 


0,13 


1,00 


0,79 


0,49 


0,22 


0,82 


14 


0,20 


0,31 


0,16 


1,00 


0,75 


0,48 


0,20 


0,79 


17 


0,17 


0,23 


0,34 


0,46 


0,52 


0,34 


0,24 


0,46 


39 


0,09 


0,11 


0,11 


0,24 


0,25 


0,16 


0,10 


0,22 


40 


0,28 


0,45 


0,13 


1,00 


1,00 


0,57 


0,28 


0,89 


41 


0,32 


0,37 


0,11 


1,00 


0,90 


0,54 


0,21 


0,84 


45 


0,31 


0,33 


0,12 


1,00 


0,81 


0,51 


0,20 


0,82 


47 


0,13 


0,25 


0,20 


1,00 


0,75 


0,47 


0,18 




54 


0,15 


0,20 


0,22 


0,25 


0,46 


0,25 


0,19 


0,36 


68 


0,79 


0,94 


0,76 


1,00 


1,00 


0,90 


0,81 


0,99 


82 


0,10 


0,27 


0,25 


1,00 


0,84 


0,49 


0,20 


0,79 


89 


0,14 


0,19 


0,20 


0,41 


0,46 


0,28 


0,18 


0,39 


92 


1,00 


1,00 


0,82 


1,00 


1,00 


0,96 


0,89 





TED^an = ted mean 
TEDii = lower limit (95%). 
TEDui = upper limit (95%). 



Figure 1: TED mean and confidence interval 




T TED_UL 
1 TED_LL 
= TED M 



13 17 40 45 54 82 9 



FIRMS 



5. Finally Remarks 

One of the most frequent problem connected with the analysis of firm 
productive processes and with the operative standard definition is the choice of a 
general method to be used in any situation. 

Nevertheless we know that the choice of a particular approach has often 
appreciable effects on the efficient scores of the firms under control. 

In this paper we have proposed an “ecumenical” approach to the productive 
processes analysis which uses five different estimation methods for the production 
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frontier function. In such a way we obtain also using a bootstrap technique an 
interval estimation of technical efficiency scores for any firm. 

As shown above in the empirical analysis such a methodological choice is 
particularly usefiil when firms under observation are characterised by high 
dishomogeneity in their productive processes. 
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Abstract: This paper deals with the problem of identifying multiple outliers 
in multivariate data. Detection of anomalous values is achieved by looking at 
the variations in the convex hull of the data set as block of observations are 
deleted. 

Key words: Outliers, Convex Hull. 



1. Introduction 

Outliers are defined as the observations which are surprisingly extreme with 
respect to the remainder of that set of data, and in univariate data sets they can 
be easily spotted by numerical or graphical inspection. On the contrary, 
detection of outliers in multivariate data clouds is a difficult task. The observed 
data, in fact, are points in a p-dimensional space and, for p>3, cannot be 
inspected directly. Neither the analysis of usual 2D or 3D projections is useful, 
because outliers do not necessarily stick out in some of the coordinate 
directions. Many exploratory and graphical methods therefore have been put 
forward; the most commonly used declare, as in the univariate case, outlying 
any extreme observation which is too far from the center of the data cloud. To 
judge extremeness, classical methods use the Mahalanobis distance based on 
the sample mean and the sample covariance matrix (Gnanadesikan and 
Kettering, 1972) but they fail to correctly detect the anomalous observations 
when there are groups of outliers, due to &e masking effect. The most popular 
robust alternative procedures (Rousseeuw and van Zomeren, 1990; Hadi, 1992; 
Aktinson, 1994) have two main drawbacks: their computational complexity is 
high and they, like the Mahalanobis distance, evaluate extremeness with respect 



* Work supported by ex-40% MURST Research Project “Nuovi Metodi di Classificazione e 
Analisi dei Dati”. M. R. D’Esposito wrote sections 1, 2 and 6. G. Ragozini sections 3, 4 and 5. 
Computations are due to G. Ragozini and were made by S-Plus code. 
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to an ellipsoidal hull, hence work well in a neighborhood of normal distribution 
and give preference to linear association among the data. 

In this paper we propose, for the detection of multiple outliers, an entirely 
data oriented procedure, which does not depend on initial estimates of center 
and variability, and is instead based on the analysis of the convex hull {CH) of 
the data sample. In fact, we suggest judging outlyingness by looking at the 
modifications of the CH volume (CHV) when groups of observations are 
omitted in turn. We declare outlying those observations whose omission mostly 
decreases the convex hull volume. 

The proposed procedure is presented in section 2 and 3, and is illustrated by 
simulated and real data sets in section 4. Related computational problems are in 
section 5. Some conclusions and directions of future work are in section 6. 



2. Proposed procedure 

Outliers in a given data set can a result of recording or transmission errors, 
misplaced decimal points, exceptional phenomena, observations on different 
population slipping in the sample, etc. In any case, they are observations which 
differ very much from the others. By a geometric point of view, outlying data 
could be seen as points that lie on the periphery of the points cloud, very far 
from the others. 

To detect them both explorative analysis and indexes of outlyingness have 
been proposed. In the class of explorative methods graphical representations 
can be very useful. Unfortunately, to visualize multivariate data we need very 
complex plots, such as parallel coordinate plot, coneplot, djmamic plot, grand 
tour and so on, all of which need an expert user; furthermore identification of 
outliers cannot be performed through an algorithm. 

This is why many procedures have been proposed based on synthetic indexes 
of outlyingness. However: 

a) the indexes often impose a prechosen form to the data cloud (usually 
ellipsoidal) to declare as outliers the observations which have the highest 
distance from the center of the cloud. 

b) most of the indexes are designed to detect a single outlier. Their 
application to the case of multiple outliers often is affected by the danger of 
masking (some outliers go unnoticed) or swamping (spurious outliers are 
detected) (Barnett and Lewis, 1994; pag. 109). 

To overcome all these problems, we propose to look at i) the convex hull of 
the data to identify the periphery and the form of cloud and at ii) the convex 
hull volume to measure the dispersion of observations. Outliers will coincide 
with the vertices of the most outer convex hulls in the sequence of nested 
convex hulls, and their presence will inflate the convex hull volume. 

To be more specific, let iS be a finite set of n observations X{ in IR’’ (/= 1 ,...,«). 
The convex hull of S is defined as the set of all convex combinations of S. 
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CH{S) = 



xx = a^x^+...+a„x„, 0<a, <1 , = i. 

/=1 



( 1 ) 



By its definition, the convex hull is the best fit for the periphery of any data 
cloud. A point Xi is an extreme point or vertex of S if CH{S) ^ Ch(s - {x, }). 

Given the vertices, the set of edges and the set of facets are identified. 

A sequence of nested convex hulls can be constructed by considering the CH 
of the entire sample and in turn the CH of the remaining sample after the 
deletion of the vertices. For any convex hull, the volume CHV can be obtained 
by partitioning the hull in m simplexes with p+\ vertices (for example, in 91^ a 
convex hull can be partitioned in triangles). The CHV is then obtained as the 
sum of volumes of the simplexes (Grunbaum, 1967) and does not depend on 
the particular partition to individual simplexes. 

The omission of one or more observations decreases the CHV. Clearly, when 
outliers are omitted, the CHV falls down. 

In our proposal, the variations in the CHV when a subset ^lof size k is 
omitted are then measured by the index: 



C{J) 



CHV{kl) 
CHV{S) ’ 



( 2 ) 



where C//F'(*/)is the convex hull volume of the subset (the 

original data set without the * 7 observations). Observations are deleted in block 



to avoid the masking effect. Hence, the index is constructed starting from a 
quantity (the CHV) which is unambiguously defined and computed, and no 
estimation of center is needed. For the univariate case (p=l), the CH of a given 
set of points is the line segment with and x^^^ as end points, and the index 

in (2) reduces to a Dixon type statistic (Barnett and Lewis, 1994; pag. 90): 




3. Algorithm 

The building blocks of the overall detection procedure are so designed to 
avoid any swamping effect and to lessen the computational complexity. 

The procedure for identifying multiple outlier is as follow: 

Step 1. Given the set S of n points x,-, CH(S) is constructed and CVH{S) 
is evaluated. 

Step 2. To lessen the computational burden the index in (2) is computed 
only for some specific ^7, . Precisely, let Fj be the set of vertices of CH(S) and 
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consider the set S without the points in Fj . Let this set be jiS - Fj } . Given 
, let V 2 be its set of vertices (for example, in Fig. 1, V^ is the set 
of points {20,27,17,14,7,15,25,6}and V 2 is the set {19,10,24,12,16}). This 
process is repeated until a percentage [an] of the original data points are taken 
as vertices, with a chosen so that [an] is the maximum number of outliers 
which the sample is assumed to contain. An upper limit for [an] is [n/2], 
typically 0.1 < a < 0.3 . Let =V^\JV 2 \J...; from now on only the points in 
will be considered for the analysis. 

Step 3. For each possible subset of size A: in 1^ , for = l,...,[an], 
the vertices of CH{S-J^ are computed. 

Step 4. For each ^ 7, in step 3, CHV(^kIi) and the ratio 
CHV[kIi)ICHV(S) are computed, for k = l,...,[on]. 



fN^ 



Step 5. Decision rule. For each k there are with N the 






number of elements in . The indexes C(t/, ) show lower values when there 
are outlying points in ^7,.. Hence, we can look at the minimum of ^(^^7, ) 
over i. When one more observation is deleted and it is an outlier, minC(^7;) is 



much lower than minC(^_i7,), and = minC(*^_i7,)-minC(t7,.) is large. 
Higher values of Z)^ point to outliers among the 7,. . If there are at most h 
outliers, C(^h^) is the minimum value among all C(*7,), with *7* the set of the 
h outliers, and minC(j^7,) for k = h + l,h + 2 ,... stabilizes atC(^7 ) and the 
difference reaches a maximum. When subgroups of outliers are deleted, the 
index could show local maxima. Note that C((,7) is equal to 1 by position. 
A typical pattern for is in Fig. 2. 

Stopping rule. Since minC(*7,.) stabilizes at C(*7*), from the analysis of 



the index plot (a, 7)* } for A = 1,2,. . . we can decide to stop the procedure when 
the values decrease less then a fixed small constant s. 



4. Some illustrative example 

In order to verify the behaviour of the proposed diagnostic procedure and to 
show how the stopping rule on iheD^'s works, we applied both on three 
different data sets. 
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The first data set refers to 28 observations on the body and brain weights of 
different animal species (in Rousseauw and Leroy, 1987) (Fig. 1). The first two 
nested convex hulls contain 13 observations and the three observations 
{6,16,25} appear as outliers. 

Fig. 2 and Table 1 have respectively the index plot and the values for 
differences. reaches its maximum for A=3 and falls down afterwards. The 
three outliers are, hence, correctly identified. 



Fig. 1: Brain and Body weights sets. Fig. 2: Brain and Body Weights. !ncie.x 

Two nested convex hulls are plotted plot of = min C( (..j /, ) - min C( ^ ) 




Table 1: Brain and Body. D^. = rtavtC[^_.^f)-mnC{i,f)valuesfork=l,...,12 



k 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


11 


mm 


Du 


0.10 


0.10 


0.41 


0.04 


0.06 


0.02 


0.04 


0.05 


0.03 


0.02 


0.02 


BiBn 



The second example is given by an artificial data set. 33 observations were 
generated by a mixture of two normal distributions: 

roi r 1 1 r 1 0 91 

0 = 0.1 



(l-a)0(x|//,,z) + a0(x|//2,E), //, = 





’O' 




■ 1 ■ 




■ 1 


0.9' 


Fx = 


_0_ 


> F2 = 


1.5 


,2 = 


0.9 


1 



(i.e. there are at most four outliers). The data set, along with the two outer 
nested convex hulls, is portrayed in Fig. 3 The total number of vertices is 18 
and by graphical inspection four outliers are easily spotted (observations 30, 31, 
32 and 33.). 



Fig. 3: Artificial Data Set. Scatter plot Fig. 4: Artificial Data Set. Index 
and the two outer nested convex hulls. plot of D. = min C( /,) - min C( ^ ) 
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Table 2: Artificial Data Set Some sorted values of ,for k- 







2l- 




■'3/, 


C(.A) 


Ji 




32 


1 


32,33 


1 


9,32,33 


1 


7,22,30,31 


0.82 


22 


0.98 


2,27 


0.98 


16,17,10 


0.98 


7,22,31,32 


0.78 


31 


0.97 


30,31 


0.94 


7,22,31 


0.85 


4,7,22,29 


0.76 


7 


0.95 


7,22 


0.87 


4,7,22 


0.82 




0.46 



Table 2 reports someC(*/,.) values when up to foxu observations are 

omitted in turn. When one, two or three observations are omitted no points 
appear as outlier. Only when the four observations {30,31,32,33} are omitted in 
block the index falls down to 0.46. When blocks of five observations are 
omitted, low values of the index appear in correspondence of the 5-tuples 
containing the four outliers. To decide on the number of outliers (4 or more) the 
decision rule in step 5 was applied. 

In Table 3 the differences D* are shown and plotted in Fig. 4. 



Table 3: Artificial data set. = min C(^_i/,)-min c(^Ii) 



k 


1 


2 


3 


4 


5 


6 


7 


£jl 


0.0495 


0.0828 


0.0464 


0.3570 


0.0852 


0.0278 


0.0152 



The procedure clearly detects the four outliers: reaches its maximum for 

k=4 and stays constant at lower values afterwards. Therefore the iterations can 
be stopped at lc=7. 

Finally, the p=3 case is here exemplified by using the set of the explanatory 
variables in the stack loss data set, anal3rzed by Hadi (1992). The data describe 
the operation plant for oxidation of ammonia to nitric acid. The three predictors 
are the Air flow, the Cooling water temperature and Acid concentration and are 
shown in Fig. 5. Four outliers appear in the upper part of the 3D plot and 
correspond to observations { 1 ,2,3,2 1 } . 





Ftg. 6: Stack Loss data: Convex Hull 
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Fig, 7: Stack Loss Data, Index plot of Table 4: Stack Loss Data set: 
= min /,)-min C(*/,) = min C(* /,)- mm C{*_, /,) 




The convex hull and the index plot for D^ are respectively, in Fig. 6 and Fig. 
7. The four outliers are correctly detected. £)* has two maximum points because 
the outliers come in two distinct groups {1,2,3} and {21}. This example shows 
the reliability of the procedure when outliers appear on opposite sides of the data 
cloud. 

5. Computational issues 

For the computation of convex hull vertices several algorithms have been 
proposed. The first computational efficient algorithm is due to Chand and Kapiur 

(1970) based on die so called gift-wrapping principle, that requires 0(n^^ 

operations. A more efficient algorithm that requires 0(n log «) operations, even if 
up to three dimensions, was proposed by Preparata and Hong (1977). Later on, 
Edelsbrunner (1987) provided a version in the primal space of the Seidel’s 
algorithm (1981) in the dual space. The Edelsbruimer’s algorithm is based on the 
beneath-beyond method and requires 0(nlogn) in two dimensions and 

+ in more dimensions. The algorithm by Clarkson and Shor 

(1989), based on the random sampling approach, requires instead 6>(n log C) 
expected number of operations, where C is the munber of vertices. It still works, 
indeed, up to three dimensions. 

Recently, Barber, Dobkin and Huhdanpaa (1996) have proposed the Qhull 
algorithm used in this paper. It combines the beneath-beyond method, the Eddy’s 
quickhull algorithm in two dimensions (1977) and the Clarkson and Shor’s 
derandomized algorithm. Its computational cost is output depending and it 
requires 0(«logC) for p < 3 and OinfctC) operations for p t A, where fc is the 
maximum number of facets for C vertices. In practice, it results to be a very fast 
algorithm. It evaluates contextually the convex hull volume without additional 
cost. 
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To evaluate the index C(^i,Ii) is, therefore, computationally feasible. In sample 

of size n the index should be computed ^ ^ times, for ib=l,...,[an] (i.e. 

y k ) 

2 times), which can be quite a large number. The stopping rule introduced in 
step 5 of the procedure can help in lessen the number of iterations required. 



6. Conclusions 

The procedure proposed appears very promising. It is simple and suitable for 
automation. A modification based on a clustering aroimd the vertices of the first 
convex hull is imder study to ftirther lessen the computational cost. The 
effectiveness and the power in dealing with masking and swamping problems 
have to be further investigated, and the method’s power should be compared with 
other available proposed procedures. 
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Abstract: It is well known that the kernel estimation of multidimensional 
densities is a difficult task due to the so-called “curse of dimensionality". The 
greater the data dimension, the greater is the sample size required to obtain 
efficient estimates. To reduce such dimensionality effects, we introduce further 
smoothing sources in addition to the usual bandwidth parametrization. In 
particular, preliminary kernel estimates are interpreted as smoothed samples and 
form the basis for successive density estimates, whose average (weights are 
given by empirical likelihoods of the observed sample) define the proposed 
sequential density estimator. 

Keywords: Curse of dimensionality; Likelihood; Smoothed sample. 



1. Introduction 

In nonparametric density estimation, efficient estimates of multidimensional 
functions require the observation of larger and larger samples, rapidly 
increasing as the dimensions increase (Epanechnikov 1969, Scott- Wand 1991). 
This situation configures the so-called “curse of dimensionality” (Huber 1985; 
Hardle 1990). 

To face this difficulty and to balance the number of dimensions with the 
available sample size, two approaches are possible. A first way is given by the 
statistical reduction of the number of dimensions, which can be obtained by 
eliminating redimdant information (via principal component analysis, projection 
pursuit techniques, etc.; see Scott, 1992). From a theoretical point of view, a 
second way is obtained by increasing the sample in order to determine a 
sufficiently sized dataset. Incidentally, we observe that the two approaches are 



’ This work, though is the result of a close collaboration of the authors, has been specifically 
elaborated as follows: sections 1, 2, 4 by M. Di Marzio, sections 3,5,6 by G. Lafi’atta. 

** The paper has been supported by a grant MURST 40% titled “Analisi dei dati spaziali”, 
national coordinator Prof Mauro Coli. 
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not symmetrical, since the application of the first one does not guarantee the 
effective balancing discussed above. In fact, the number of maintained 
dimensions can remain too high for successive efficient density estimations. As 
a consequence, in this paper we investigate the second approach and we develop 
a method to reduce dimensionality effects on kernel density estimation. 

The paper is organized as follows. Section 2 introduces the concept of 
smoothed sample as a tool for increasing the size of available data. In Section 3 
we discuss how to draw samples from estimated multidimensional densities. 
Section 4 contains a full description of the estimation algorithm we propose. 
Section 5 gives some evidence about the efficiency of the method via Monte 
Carlo simulations when sampling from standard bivariate Gaussian 
distributions. Finally, in Section 6 we report some concluding remarks. 



2. Preliminary Sample Smoothing 

Assume a sample S = {Xi,...,X„} of size n, with Xv = (Xvi,..., Xv^), v == 
is drawn from an unknown density / of a /^-dimensional random vector X= ( 
Xi,..., Xp). The conventional kernel estimator of X*) (Rosemblatt 1956, 
Cacoullos 1966) is 



/(x;X,H,S)=«-'Xx„(x-Xj, 

v=l 



( 1 ) 



where 



is a kernel function, with K a /^-dimensional density and H a /» x /? symmetric 
positive definite bandwidth matrix. When p>\, we use diagonal bandwidth 
matrices, so assuming an isotropic hypothesis. As a consequence, H must be 
seen as equal to Ip for some scalar h, with Ip the px p identity matrix. 

Kernel smoothing procedures determine an estimate / of / on the basis of the 
sample S. In this way, the function / is treated as the final object of the analysis 

and no attempt is made to use / as a density fi-om which whatever sized 
samples can be potentially drawn. Clearly, care has to be taken when choosing 
/. Intuition, for example, suggests the use of small bandwidths in order to 
remain as close as possible to the observed sample. Furthermore, one should be 
aware that drawing a sample firom / (no matter its size) does not gxxarantee 
effective results. The use of a sequence of preliminary kernel estimates from 
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which samples are drawn in turn seems a better strategy. In order to apply such 
considerations, let us start with the following: 

Definition 1 Assume that Kh is a kernel Junction and g a p-dimensional density. 
Lety be a point in the support of g and q: [0,+oo[ the following function: 

Vu 6 : q(u;y) = Kg (y - u) 

Then we define the Expected Proportion of Points for the pair at y as 

the integral: 



EPP{y-,n)= \g{i)dz (2) 



where E(S.,y) = supp( q) if the support of q is bounded, while E(B.,y) =B(y,h), 
i.e. the ball in W’ centered at y having radius h, in the other case. 

Assume that, for a fixed value of y, we have to compute EPP(y;H). Given a 
sample drawn from g, we are able to execute a kernel estimation of g, obtaining 
say g . Thus, we estimate the required integral, say .EPP(y;H), by substituting 
g to g in (2). 

The use we make of this concept is the following. We evaluate the EPP 
function at the sampled points Xi,...,X„ in S. The EP/*(X,;H) value can be 
interpreted as an estimate of the probability that a generic sample point belongs 
to the set P(H,y), which represents approximately the region whose points, if 
sampled, will receive a non-zero weight through which they will contribute to 

n ^ 

the estimate of f(Xj). We interpret the mean n"'^£PP(X, ;H) as the expected 

/=i 

frequency of points which will belong to a non-zero area when estimating one 
of Ae values f(Xi),...,f(X,J. As a consequence, given a sample S’ of size n\ we 
can estimate the expected proportion of points of S’ which will belong to a non- 

n ^ 

zero area as the value (nVn)^PPP(X,;H). If, in particular, we refer to 

/=1 

sample S, such expected proportion will be equal to the following index 

SEPP(a,S)=^ VPPP(X,;H). 
iefr,n} 



Finally, we determine the bandwidth hg such that, for a real constant 0 < c < 1 , 

hs = sup{h : SEPP{U,S) < c}. 



( 3 ) 
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As a consequence, if we use a bandwidth € ]0,/is], our expectation is that, 

when estimating y(Xi) by means of /(X, ;AT,H,S) , no more than c points in S 
will have a non-zero contribute in determining the value of the estimate. 

This enables us to state the following 

Definition 2 (Smoothed samples) Let She a sample drawn from f: 

1 . every kernel smoothing f (•; K, H, S) such that h<hg is a smoothed sample; 

2. for every smoothed sample f if S’ is a sample drawn from f , and h<hg, 

then /(•;X,H,S') is a smoothed sample; 

3. the only smoothed samples are those given by 1 and 2. 

Smoothed samples are useful tools when increasing the size of available data as 
we will describe in Section 4, when introducing sequences of smoothed samples 
obtained with increasing bandwidths. In Section 3, instead, we discuss how to 
obtain effectively a smoothed sample. 



3. Sampling from Multidimensional Densities 

As the number p of dimensions increases, the frequency of points, sampled 
from/ which belong to the distribution tails, i.e. those areas whose probability 
is relatively small, will also increase. As a consequence, when estimating / on 
the basis of small samples it will be probable to overestimate the tails and to 
underestimate the relatively more probable areas. So, when we have to decide 
from what region W to execute the (re)sampling procedures from given 

estimates / i.e. that region which the sampled points will belong to, the above 
discussion must be applied in some way. In particular, we state for W the 
following choice: 



^ = nKa>^,,.00-J (4) 

/ 

in this case is intended as the Cartesian product operator, 0 < a < 10 , and 5j,^ 
stands for the frth. percentile of the distributional,,. . . X„,. 

To obtain samples from / , we apply the rejection method (see, for example, 
Ross 1996) as follows. We define the density 

^(y) = ^y) 

n ’lOO-a ’a ) 



and the constant 
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where Z) is a p-dimensional grid on fV. Thus it can be expected that, for all 

f(y) 

y eW, <a, so that the method applies by generating Y from g, U from 

g(y) 

the uniform density — lroii(M) and hence accepting y as generated from / if 

7(y) 

U < and rejecting it otherwise. This enables us to consider Y as 
distributed following / as required. 



4. Sequential Kernel Density Estimation 



We operate a set of density estimates = each of which is obtained 
applying an iterated kernel smoothing procedure as follows. The algorithm 
consists of two steps. 

The first step, referred to as the smoothing step, is intended to increase the size 
of available data. A sequence of say k smoothed samples 

i = with/i,</is , is generated given a vector of increasing 



bandwidths and a vector of integers representing increasing 

sample sizes. The number of smoothed samples k can be identified, for 
example, as the number of dimensions p, or as the number of modes in a pilot 
estimate of f. Our aim is to increase the sample size in some a way which 
minimizes the lack of information given when passing from a smoothed sample 
to the other. As a consequence, the hj values need to be very small and 
increasing very slowly at first, while they need to increase more rapidly only in 
the last steps. So, let us define vector z whose i-th element, i= 1,...,A:, is given 
as z, = hg{i/k), and observe that the points z, are equally spaced in the interval 







. In order to apply the discussion reported above, we choose to 



define /i, as the value assumed at Z; by a given convex fimction v: I ^ ^ [O, /jg ] . 
When / = 1, let us state Sj = S and =n. As a consequence, the first smoothed 

sample will be the same for all j = For 1 < / < A:, we^/( ;ZCH,,S,), the 

z-th smoothed sample in the sequence, and, successively, a sample S;+i of size 
n,+i is drawn from we obtain the sample S*, so we are able to 

compute the last smoothed sample ^ /(• H ^ , S ^ ) . 
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In the second step, referred to the estimation step, we draw a sample of size 
rij from with n^<n compute jf as the kernel estimate 

/(• s), obtained on the basis ofyS given a bandwidth h*. 

Now, let yZ(S) = ; /(X, ) be the likelihood that assigns to sample S. Then 

1=1 

we suggest estimating the analyzed density through the weighted average of 
densities jf as follows: 



/(•;S) = 






m ni ^ 

I.AS) 






( 5 ) 



5. A Monte Carlo Study on Bivariate Gaussian Density 
Estimation 

In this section we test the effectiveness of the proposed methodology through 
simulation experiments. We execute a Monte Carlo study in order to compare 
S,), the conventional kernel estimator (1), with 7(-;„S,), the 
sequential kernel estimator (5). For n = 50,60,80,120, we draw 200 samples of 
size n, say „S 2 oo, from a Gaussian standard bivariate distribution / We 

compute, for all / = 1,...,200 and for e = /,/, the integrated standard error 
when estimating/by means of e given „S/: 

, {e) = S, )- /(x)7 d% 

where K is the Standard bivariate Gaussian kernel function and H* is the 
bandwidth matrix which minimizes the MISE when estimating a Gaussian 
Standard distribution. We also employ H* in the estimation step of the 
sequential algorithm. Since we introduce further sources of smoothing other 
than the bandwidth matrix, we decide to hold constant the bandwidth, so 
avoiding the problem of assessing the contribution on estimation performances 
of bandwidth selection procedures. 

We set c = 1 in formula (3) and we select the convex function 

v(z)=exp(l5(z-/i,7). 



In addition, we set m = 2, k = 2, and we define n.=ni, i = \,...,k, 
ri* = n{k + 1) and a = 5 in formula (4). 
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In order to obtain a comparison which takes into account the sampling 
distribution as a whole, we determine, for both e = f and e = f , the following 
estimates of the Mean Integrated Standard Error for e\ 

„MISE,{e) = r^Y,JSE,{e), 

7=1 

in which, for / = 200, the /-th estimate is performed using the first / 

samples in the considered sequence. Hence we compute, for / = 1,...,200, the 
folloAving index: 



^.A^lOo 

r,MISE,[f) 



( 6 ) 



Finally, for the purpose of comparing the relative performance of the two 
estimators sample by sample, we consider, for / = 1 . .,200, the following ratio: 



„^,=^ 100 . 

nISE\ 



( 7 ) 



6. Discussion 

Table 1 reports some of the simulation results. In particular, we record a value 
for 5 oAf 2 oo equal to 54.43 which, shows that, with a sample size equal to 50, our 
method reduces the MISE of about 46% if compared with the conventional 
kernel estimator. Increasing the sample size, the ratio „A/ 2 oo between the MISE's 
of the two methods decreases. For n = 60,80,120, we have, respectively, JV /200 = 
51.27, 46.32, 37.99, hence the relative overperformance of our method 
increases. 

As reported in Table 2, MISE's decrease differently for the two methods. For 
example, observe that increasing the sample size of about 60% determine a 
MISE reduction of -30.96% for our method, while for the conventional kernel 
estimator it reduces of -18.87% only. This means that our method employ the 
increased sampled information on/in a more efficient way. Further sample by 
sample comparisons are described by means of the Z)/ index. In particular, we 
find that, for «=50,60,80,120, the condition A >100 holds true in 
correspondence, respectively, of 19,18,19 and 3 of the 200 considered samples. 
As a result, the conventional kernel smoothing seems to suffer of relatively 
poor performance if compared with the sequential kernel smoothing when 
es timating bidimensional densities. In this paper we report some evidence about 
the fact that additional sources of smoothing can play an important role in 
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facing "the curse of dimensionality". In particular, the use of smoothed samples 
as a tool for increasing the sample size and the use of weighted averages of 
different sequential kernel estimates reduce consistently the Mean Integrated 
Standard Error. Further studies are obviously required in order to generalize 
these results to estimation problems involving dimensions higher than two. 



Table 1 : Statistics of simulation results. The conventional kernel estimator is 



indicated by CE, while the sequential kernel estimator by SE. 



n 


ISExlO^ 


Mean 


Median 


Min 


Max 




SE 


4.2411 


3.6577 


0.6813 


21.0798 


50 


CE 


7.7920 


7.2820 


0.8698 


20.6258 




Ratio 


0.5443 


0.5023 


0.7832 


1.0220 




SE 


3.7148 


3.1254 


0.5013 


16.6870 


60 


CE 


7.2453 


6.8353 


1.1513 


17.4455 




Ratio 


0.5127 


0.4572 


0.4354 


0.9565 




SE 


2.9280 


2.4583 


0.4878 


12.3037 


80 


I 


6.3215 


6.0230 


1.0313 


15.0868 




Ratio 


0.4632 


0.4082 


0.4730 


0.8155 




SE 


2.1147 


1.7720 


0.4501 


7.0298 


120 


I 


5.5666 


5.3470 


1.6379 


10.7521 




Ratio 


0.3799 


0.3314 


0.2748 


0.6538 



Table 2: Sample size increases and MISE’j decreases for the conventional (CE) 
and the sequential (SE) estimators. Variations are referred to the case n = 50. 



Sample size increases 




CE MISE’s decreases SE MISE’s decreases 


20% 


-7.02% 


.41% 


60% 


-18.87% 


.96% 


40% 


-28.56% 


.14% 
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Abstract: Data Analysis in Shewhart’s Control Chart, to use the original m 
samples n sized intensities, is the main subject of this paper. Given mxn 
intensities we examine three alternatives to sintetize the variability: a) 
arithmetic mean of m standard deviations ( 5); b) root mean square of m 

variances ( S) ; c) global dispersion ( S) . We prefer the global dispersion to 
estimate parent population <f. 

As an alternative we suggest to analyze all the items of an unique random 
sample dimensioned in such a manner to have an efficient estimate. A 
second introducted proposal is to use the Factory’ s needs: [Pq ,P^, a,p,L and U ) . 
Some examples are given in the last session of the paper. 

Keywords: Shewart’s Control Chart, Sigma's Estimate, Data Analysis. 



1. Introduction 

Using S.C.C. (Shewhart's Control Chart) it is customary to operate 2 stages; a 
first stage devoted to data collection and limits LCL^ , UCL ^ , LCL ^ , UCL^ 

(Lower Control Limit and Upper Control Limit for mean and dispersion) 
computation.The second stage is devoted to chart's use. 

In the first stage it is customary to produce K = m-N items, (in other words we 
have m lots N sized), to draw m single random samples n size from each lot 
N sized. The population is given by all items produced and to be produced, its 
mean is |x and its variance is |X and supposed stable in the first stage 
(items produced). 

Let us call x^ the ith intensity of the jth sample, so the jth sample mean is: 

^ = (i = l»2,...n) 

is the jth sample variance estimate; 



( 1 ) 
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^ = Sj 

is the mean of the sample means. 

Sample mean synthesis create no problem, not the same happens for s^ or j . 
Indeed some authors (W.A. Shewhart, 1931); (A. J. Duncan, 1965); (P.L. 
Piccari, 1974); (D.C. Montgomery, 1991) propose to compute: 

S = J^Sj/m ( 2 ) 

Some other authors (Mittag-Rinne, 1993) propose to compute: 



finally one may also compute: 

In this paper we study the rationale of each solution and we suggest an 
alternative proposal. 



2. Synthesis analysis 

Since root mean square is greater than or equal to arithmetic mean, we may 
write: 



'S<"S 

5 

and declare that one of the introduced formulae can't be correct. Relation (2) is 
the main suspect because since: 

E{s)^a 

The same may be said for (3), and this means S to be a biased (T estimate. 
Someone notes that, if the underlying population is normd, S actually estimates 
a c^; this is statistically correct but a little cumbersome. We remember that c^ 
is a constant depending on the sample size n: 

c,={2/(n-l)f^r{(n-l)/2}; 

tabulated values are presented in Duncan (1965). 

Let us now consider the synthesis of sample variances (3). Kenney and Keeping 
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(1956), showed that: 

E(s^)=a^, 

not only for simple samples, but also in presence of m simple samples. In case 
h independent samples are available from the universe, they suggest to use: 

= Ql{U-h)-, 

where 

j2~nj 5] +02^2“^ 

U=-n^ +1I2 + +«fc; 

and sf is the variance in the ith sample consisting of nj variates. 

If n,. =n is the same for every sample, we have: 

+s\+ +sl^j[U—h); 

where U= n • h. Clearly the last relation may be written in the form: 
{n-l)/n-d^=(si +sl + +sl)/h 

The constant (n-l)/n is present because the authors started with 
=X , {X^j-X^ jn instead of Sj , but if the degrees of freedom are used, the 

result is correct and consistent with: E{^s^^ = <7^ . This solution records time 

variations. In other words we have a trace of variability changes during data 
collection period. 

Finally relation (4) is based on the whole group. It may be seen as the total 
variance, while may be seen as within variance. Deviances are the same if 
between variance is equal to zero. 

There is someone discouraging its use. For instance D.C. Montgomery (1991), 
affirms that the estimate of the process standard deviation a used in 
constructing the control limits is calculated from the variability within each 
sample. Consequently, the estimate of a reflects within-sample variability only. 
It is not correct the estimate of a based on the usual quadratic estimator, say 
S , because if the sample means differ, then this will cause S to be too large. 
Consequently, in this way ,ct could be overestimated. 

A. J. Duncan (1965) shares the same opinion, and retaines that is not correct to 
estimate the process standard deviation from all the data (e. g. S) and use this 
in setting up limits for the X -chart. The estimate of the process standard 
deviation to be used in setting up limits for the X -chart must be computed from 
the within-sample variation to the exclusion of the between-sample variation. 
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Let us remember that if a production process presents stable between-sample 
variation it could be a good rule to look for the trouble and to remove it if 
possible.If the problem persist we do not see why to ignore it, computing the so 
called within variation. Another important remark is the difference between 
"first stage" and "second stage". In the second stage production must be 
monitored so that it is very useful to divide output into lots, let us say N sized, 
and investigate every single lot produced. If no trouble appears production can 
continue; on the contrary, if a trouble comes out it is much better to stop 
production and to look for happenings. In the second stage, points are regarded 
as independent events and O.C.C. (Operating Characteristic Curve) is computed 
under this assumption (G. Rouzet, 1957). In short, the division of production 
into lots N sized is a suitable procedure for the second stage as we said before. 
The first stage problem is a different one. to estimate p and (f related to the 
character of interest. The subject involved is the parent population and its 
parameters. The division of items into lots N sized is not an essential operation. 
Perhaps the sample repetition is a mechanical consequence of the second stage 
technique, to some extent necessary if n=5, because |i and estimates based 
on so a little sample should be extremely poor ones, so to have both ways saved 
some authors suggested to repeat the sample (and the lot) m times (Mittag- 
Rinne,1993). It seemed therefore a natural consequence to compute Xj, Sj and 
S , S and S . 



3. Simulation 

In order to emphasize our opinion we consider a simulation. We shall use Wold's 
Random Normal Deviates divided into lots N=50 sized, one numbers column for 
lot. From each column we draw one sample n sized and this operation will be 
repeated m (=20) times as in the first stage practice. We compute m x and 
m cr^,and the synthesis is compared with '"S^. We define: DifTot = d - "S^ 
and DifUni = d We also noted that here d is the population variance 

computed on N-m data = 1000 considering series of 100 samples. If 
DifTot<DifUni one point is given to "S^, but if DifUni<DifTot then one point is 
given to '"S^. 

For series of samples n = 5 sized we found more than 75% points for DifUni, 
then for ”’S\ 



4. Alternative proposals 

The first stage procedure is a very expensive one. Infact after m samples we 
must revise the production process, therefore to save time and money we 
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suggest to analyze all the items produced within the first stage and dimension 
this sample according to wanted protection. 

Our suggestion seems particularly useful for destmctive control analysis 
because with customary procedure if not analyzed items are out of tolerance, 
production-control costs increases. 

Calling N the first stage lot size, we shall have: X='^iXjN , and 
S’=I.(X,-X)V(N-1) , as an unbiased estimate. 

A different suggestion is based on the introduction of Factory's needs 
(Pq ’^1 andU). Many authors, use symbol L for Lower specification 

limit and symbol U for Upper specification limit. 

Now let us call Pg the well known Acceptable Quality Level and we underline 
that it seems suitable subdivide Pq into to parts, the one on the left lPq (fraction 
of too small items) and the other on the right u Pq (fraction of too large items), 
of course lPq + uPq ~ ^o- This is enought for the computation of; 

Xo =(Lz„ -Uz J/(Z„ -Zj; (Zl <0) 



(To=(U-L)/(Z„-ZJ; 

where Z^ is the normal standardized fractile given lPq. and Z„ the one given 
uPq. Xo and Cg are the parameters to be used for Shewhart's variables Control 
Chart computation. 

The see so obtained is a very different tool because it privileges Factory's 
needs, whereas customary procedure privileges process capability. Therefore, 
once obtained the new SCC (UCL- , LCL- , UCL^ ) we must look if production 

process is able to output material just as designer wants (L and U). 

For this test we must collect N data related to the character of interest and 

compute - xj y^(N-l) , the variance of the last N items produced 

and compare with o^. If < CfJ the process is capable. 

According to capability studies experience it is better to accept the process if 
5^0^ <1.23. We did not use here the so called natural tolerance concept 
because it is enought to compare directly variances in order to have the test 
accomplished. 

If production process is not able we suggest an innovative maintenance keeping 
into account cost embroiled with this operation. 

To complete Factory needs list we remember known as Lot Tolerance 
Percent Defective; a, the producer's risk or first type error probability; P, the 
consumer's risk or second type error probability. All these values must be 
contractually chosen. 
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5. An example 

Let us take some data based on the: "Inside diameter for automobile engine 
piston rings" (Montgomery, 1991, pag. 234). 

We have: X = 74.001; '5 = 0.0090; "5 = 0.0099;"'5 = 0.0101. The limits for 
the X chart are: 

UCL- =X+Aj 'S = 74.001 +(1.427) (0.009) = 74.014; 



LCL- =X-Aj 'S = 74.001- (1.427) (0.009) = 73.988; 
and for the S chart: 

UCLs = B 4 'S = (2.089)(0.009)=0.019. 

If we introduce : L = 73.981; U = 74.021; ^Po = 1% vK = >(P = 2%) we 

can compute the new parameters according to our proposal. We consider: 

Zl = -2.32635; = 2.32635; xl.ooi = 16.92386; 

Xo= (L Zu-U Zl)/(zu -zJ = 74.001; 

ct„=(U-L)/(zu-zJ = 0.0086. 

Therefore: 

UCLj =74.001 + 3 (0.0086)/^ = 74.01254; 

LCL- =74.001 - 3 (0.0086)/V5 = 73.98946; 



UCL 3 = tTo sf^/(n-i) = 0.0086V16.92386 = 0.0177. 

We note that our UCL^ is based on the -distribution as suggested by Duncan 
(1965). Our limits are slightly narrower than Montgomery's ones, but if factory 
needs are the declared ones (L and U) we have a production process not 
capable. Here is very important the designer responsibility because a little larger 
tolerances would change the situation. 

We repeat the observation outlined above. Customary control charts privileges 
process capability. In presence of a chart (UCL and LCL for x and ^ ) we 
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must verify if designer's needs (L and U) are satisfied. On the contrary with our 
control chart designer needs are privileged but we do not know if production 
process is capable. In order to get this last peace of information we must 
compare "S with Cq. Of course if "5 <(To the process can satisfy designer's 
needs; on the contrary we must solve the trouble. For the process capability 
analysis many references are given by Montgomery (1991). 



6. Use of the chart 

In order to use the chart we draw a sample with n = 5 and compute x and s . 
With the following data ; 74.002; 73.990; 73.997; 74.003, 74.001, we obtain: 
X = 74.002 s = 0.002588. The points are within limits so that production is 
good. Let us now try to use the chart as proposed by Duncan. We have: 
0.0000067 and (UCL,)" =0.000079-4.230965 = 0.000313; the point is 
within limits as before. If we suppose to have a point very near the limit e.g. s 
=0.0176, squaring it we get: = 0.00030976 and it is also within limits. 

If the point is just out of control e.g. s =0.0178, we get = 0.00031684 and 
the point is just out of control also in the new chart. 



7. Conclusions 

see (Shewart's eontrol ehart) is based on process ability to produce wanted 
items. Indeed, if control limits (ueL^ LeL^^ UeLj ) are computed on either 

"S^ or it is not worthy to insist on process ability, eiearly there is also the 
designer and their needs (L,U) to be considered so we must consider a 
capability study to test if they are in accordance. 

We have seen how it is possible to get new control limits based on and 
keeping into account designer's needs (L and U). Items must be output by 
production process so that now must look if it is able to do its work. 

A different subject is the dispersion estimate related to SCC taking into account 
the presence of m lots. We have seen that 's is a biased statistic very 
cumbersome to be adjusted. A simulation leads to prefer ”'S^ but this means to 
use a chart instead of a cr chart. Nevertheless in our example we used "5 
and "5^ in order to simplify the discussion. 

It seems worthy to remember that customary control chart construction 
privileges production process capability whereas our suggestion privileges 
designer needs. In every case we must verify the second coin’s face to compare 
"5 with cTq or better '"5 with Better else to compare '"5^witho^ because 
" ' 5^ is a unbiased estimate. 
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Abstract: The aim of this paper is to extend projection pursuit regression to the 
case of mixed predictors, according to two different approaches. The former 
consists in converting each categorical regressor into dummy variables. The 
latter consists in preliminarily transforming the predictors by means of principal 
coordinate analysis. In presence of strongly non-linear regression functions and 
interactions between predictors, both procedures improve the results obtained 
by multiple linear regression, distance-based regression, MORALS and ACE. In 
particular, projection pursuit regression in conjunction with principal coordinate 
analysis shows very satisfactory performances. 

Keywords: ACE, Distance-Based Regression Model, Mixed Predictors, 
MORALS, Principal Coordinate Analysis, Projection Pursuit Regression. 



1. Introduction 

Projection pursuit regression (PPR) is a non-parametric regression method 
(Friedman & Stuetzle, 1981; Friedman, 1984a) developed for continuous 
explanatory variables. We propose to extend it to the case of mixed predictors 
following two different approaches, exploited so far in the context of linear 
regression analysis. The former consists in converting each categorical 
regressor into dummy variables. The latter consists in preliminarily 
transforming the predictors by means of principal coordinate analysis (PCA) 
(Gower, 1966). The two approaches are tested on simulated and real data sets 
and compared with classical linear regression (CR) and three procedures 
purposely developed to handle the case of mixed data: the distance-based (DB) 
regression (Cuadras and Arenas, 1990; Cuadras et al., 1996), the multiple 
optimal regression by alternating least squares (MORALS) (Young et al. 1976; 
Yovmg, 1981) and the alternating conditional expectation (ACE) method 
(Breiman and Friedman, 1985). The presented simulation studies stress the 
distinctive feature of PPR methods to model strongly non-linear regression 
surfaces and interactions between predictors. 
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The paper is structured as follows: in section 2 we briefly describe PPR, as 
developed by Friedman (1984a), in section 3 we present the two approaches 
proposed for the treatment of mixed predictors in PPR models. Examples and 
conclusions are discussed in the last section. 

2. Projection pursuit regression 

PPR, introduced by Friedman and Stuetzle (1981) and refined by Friedman 
(1984a), is a non-parametric regression method for modelling a ^-dimensional 
random vector Y, of response variables, as a function of a /(-dimensional 
random vector X, of predictors, on the basis of a sample of n matched 
observations of {Y,X). Each response variable is modelled as a different linear 
combination of smooth functions of different linear combinations of the 
predictor variables. The model takes the form: 

E\Yi\x]=Hy^+ S («w ' A' I = ..,q ( 1 ) 

m=l 

where juy^ = E{Yi), a„ are p-dimensional imit projection directions, and 
are univariate smooth functions, with zero mean and unit variance, of the 
projections , m=l,2,.., Mq . 

Friedman’s PPR algorithm (1984a) estimates the coefficients and aj„ by 
least squares, and the smooth functions , for each selected projection 
direction, using a variable span smoother, called the supersmoother (Friedman, 
1984b). The algorithm proceeds by finding a model with M > Mq terms and 
then pruning the model back to a total of Mq terms, where Mq and Mare user- 
specified parameters. 

In the examples presented in section 4 we consider the case of a single response 
variable (q=l) and adopt the fraction of imexplained variance (FUV) as a 
measure of goodness-of-fit for PPR models (Friedman, 1984a): 

n\ _ ^0 .. . 

FUV = X Yi-y- Z Xi) / ZU - - yf (2) 

m=I J / i=I 

where y is the sample mean of Y, Py,^ , and are the estimates of , 
and in model (1), respectively. 
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The introduction of linear combinations of smooth functions and of predictors 
allows PPR to model non-linear regression surfaces and interactions between 
explanatory variables, respectively. 



3. Two approaches to the treatment of mixed regressors in PPR 

The PPR model has been developed for continuous predictors (X). If X is 
composed of continuous, binary and categorical variables, we propose to 
transform the data in two different ways. The former consists in replacing, in 
model (1), each categorical predictor by dummy variables. The latter consists in 
preliminarily transforming the explanatory variables by means of PCA (Gower, 
1966). This method starts by constructing an (nxn) Euclidean matrix of 

dissimilarities D = i,j = l,...,n between every pair of individuals in the 

sample. We adopt the similarity coefficient proposed by Gower (1971) to deal 
with mixed data, so defined: 



^iJ ~ 



i (i-\xih-Xjh\/Gh)+a + c 
h=l 



l\pi + iP2-b)^P3\ 



( 3 ) 



where pj is the number of continuous regressors, a and b are the number of 
positive and negative matches, respectively, for the dichotomous regressors, 
and c is the number of matches for the pj categorical regressors. Gf, is the 
range of the h-th continuous regressor. In absence of missing values, 
I i,j = l,...,n is positive semi-defmite and this implies that the 

dissimilarity matrix D=^dy^, with dy = ^1 - Sy , is an Euclidean matrix in 

the sense that there is an exact configuration of points in n-1 dimensions or 
fewer, with exactly this matrix of Euclidean distances (Gower, 1971). The 
transformed predictors are obtained by the spectral decomposition of the 

symmetric matrix B--0.5HD^H, where = ^dy^ and His the {nxn) 

centring matrix. The solution is given by X* = , where A is the diagonal 

matrix containing the non zero eigenvalues of B arranged in descending order 
and V is the matrix containing the corresponding eigenvectors. A small number 
k of columns of X* are selected and inserted in model (1). In our examples we 
retain as many columns as the number of the original predictors, presenting the 
highest absolute correlation coefficient with the dependent variable, as 
suggested by Cuadras and Arenas (1990) and Cuadras et al. (1996). 

In the following section we compare these two solutions with the CR model, the 
DB model, which assumes a linear relationship between the dependent variable 
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and the A: selected columns of X*, the MORALS method, which consists in 
maximising the multiple correlation coefficient by using an algorithm based on 
the alternating least squares and optimal scaling principles, and, finally, with the 
ACE method, which finds smooth non-linear transformations, both of the 
response and independent variables, that produce the best fitting additive model. 



4. Examples and conclusions 

We test the performances of the two proposed procedures in handling the case 
of mixed predictors (PPR with the method based on converting categorical 
predictors into dummy variables, PPR-1; PPR in conjimction with principal 
coordinate analysis, PPR-2) both on simulated and real data sets, and compare 
them with the results obtained with classical linear regression (CR), distance- 
based regression (DB), MORALS and ACE. The S-Plus functions ppreg{ ) and 
ace{ ) are used for PPR (PPR-1, PPR-2) and ACE methods, respectively, while 
the SAS proc transreg for MORALS. 

The performances of the different methods are evaluated on the basis of the 
fraction of unexplained variance (FUV). 

Simulated data 

Two samples, each of 100 observations, are generated according to the 
following models: 

Cl: 

Y = 0.17X]- 0. 52X3 +0.26X2- {0. 2X4 + 0. 43 X; + 0. 64 X^ ) + s 
C2: 

Y = sin{0.17Xi -0.26X2)^ +sin{0.52X3 -{0.2X4 +O.43X3 +0.64X^)Y +s 

where X j and are normally distributed with zero mean and unit variance, 
is binary, X4, Aj, A^ are dummy variables representing a three-state 
categorical predictor and s ~ N{0,0.04). 



Table 1: Fraction of unexplained variance for cases Cl and C2. 







Cl 


FUV 




C2 


FUV 


CR 






0.7083 






0.9357 


DB {k=4) 






0.6106 






0.7847 


MORALS 






0.4581 






0.4411 


ACE 






0.5106 






0.7160 


PPR-1 


Mo=3 




0.2769 


Mo=3 




0.3720 


PPR-2 ik=4) 


Mo=3 




0.1701 


Mo=4 




0.1890 
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PPR-2 shows the best results in both situations (see Table 1), but PPR-1 can be 
deemed very satisfactory, too. PPR-1 allows the interpretation of the solution in 
terms of the original regressors but it could become unreliable when the number 
of categorical predictors or categories increases. In such a situation the use of 
PPR-2 is advisable. 

The poor performances of the DB model are due to its inability to model 
strongly non-linear regression surfaces. MORALS and ACE fail, as they are not 
able to capture interactions between predictors. CR performs the worst because 
it cannot deal with either non-linear relationships or interactions. 

Real data 

This example is taken from SAS/IML User’s guide (1985, p. 67). The data 
come from an experiment in which nitrogen oxide emissions from a single 
cylinder engine were measured for various combinations of fuel, compression 
ratio and equivalence ratio. Only two kinds of fuel, ethanol and indolene, are 
considered, as in Cuadras and Arenas (1990) and Cuadras et al. (1996) where 
the same data set is used to test the performances of the DB model. The data set 
consists of 110 observations. Two predictors, compression ratio and 
equivalence ratio, are continuous, while the remaining one, fuel, is a two-state 
categorical variable. 

All methods, but classical linear regression, show good performances (see Table 
2). In particular our procedures yield very close results to MORALS and ACE. 



Table 2: Fraction of unexplained variance for fuel data. 





FUV 


FUV 


CR 




0.7709 






DB (yt=5) 




0.1126 






MORALS 




0.0297 






ACE 




0.0359 






PPR-1 


Mo=l 


0.0407 


Mo=2 


0.0222 


PPR-2 (k=5) 


Mo-1 


0.0643 


Mo-2 


0.0300 



Figure 1 shows that the residuals for the CR model are strongly structured, 
revealing that linear regression doesn’t succeed in catching the bimodal shape 
of the regression function. This shape is clearly detected by the one-term PPR-1 
model (see Figure 2). In Figure 3 the corresponding residuals, with their smooth 
representation, are plotted against the fitted values. 

The smooth function determined by the one-term PPR-2 model shows a non- 
linear shape (see Figure 4). This is the reason why the DB model performs 
worse, resulting in structured residuals (see Figure 6). On the contrary no 
structure emerges from the plot of PPR-2 residuals (see Figure 5). 

The results for both PPR-1 and PPR-2 are only slightly improved by adding one 
more term. 








Figure 4: Smooth ftmclion Vs Figure 5: Residuals (l) and 

linear projection (one-term smooth residuals (2) Vs fitted 

PPR-2) values (one-term PPR-2) 
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Figure 6: Residuals (l) and 
smooth residuals (2) Vs fitted 




On the basis of these analyses and of further real and simulated examples, we 
can conclude that our proposed procedures can be deemed valid competitors to 
the parametric and non-parametric well-established regression methodologies 
developed to deal with mixed explanatory variables. 

Projection pursuit regression based methods for the treatment of mixed 
regressors could be improved by inserting the predictor transformations in the 
model building stage rather than in a preliminary one. We are now working on 
this possibility following the same line of reasoning underlying MORALS and 
ACE procedures. 
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Abstract: Dealing with high-frequency time series, such as environmental ones, 
raises important inferential and computational problems. Environmental mon- 
itoring and forecasting, for instance, require statistical procedures giving reliable 
estimates of unknown parameters and forecasts in real time. In this paper we 
consider dynamic linear models as a basic tool for the analysis of such kind of 
data and propose a recursive estimator for system parameter. A comparison of 
this estimator with some other estimation methods is provided via Monte Carlo 
simulations. The estimator we propose is computationally efficient and very easy 
to implement. Moreover, in our simulation study, it exhibits good asymptotic 
properties. 

Keywords: Dynamic linear models, system parameter estimation, Kalman filter, 
method of moments estimators. 

1. Introduction 

Usually, time series of environmental data are collected with high frequency. 
Air pollution data, for instance, are usually collected in real time. In this context, 
a statistical model is required to give forecasts in a very short time. To this pur- 
pose, a useful tool is provided by dynamic models for which recursive estimation 
and forecasting procedures are available. For the concentration of each pollutant, 
Italian law fixes two thresholds: a warning threshold and a high risk threshold. 
Whenever the concentration of a pollutant becomes higher than the correspond- 
ing warning threshold, public authorities must implement some environmental 
policy interventions (such as traffic and house heating restrictions), in order to re- 
duce pollution to an acceptable level. If the concentration of a pollutant becomes 
higher than the high risk threshold, citizens’ health might be seriously damaged, 
so public interventions are even more urgently needed. 

In the light of the considerations above, it is clear that environmental mon- 
itoring represents a great challenge for the statistician. In fact, we are required 
to provide forecasts about any crossing over of warning and high risk thresholds, 
and we should do it as early as possible, in order to allow prompt interventions. It 
is also obvious that reliable statistical models are required for this purpose. Since 
we usually have to deal with many response and explanatory variables, it follows 
that such models might be quite complex. Moreover, observations collected with 
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a relatively high frequency are usually numerous, and this implies that data stor- 
ing and handling might be quite problematic and time demanding. It is worth 
noting, finally, that environmental systems evolve over both space and time, so 
we have to deal with temporal and spatial dependences among observations and 
with many kinds of temporal and/or spatial changes in the relationships among 
the variables we are interested in. For all these reasons, highly flexible models 
and efficient estimation procedures are needed. In such a context recursive esti- 
mators are mostly useful, in that they permit the achievement of forecasts as soon 
as new observations are available. 

In Section 2, we give a short description of dynamic linear models and pro- 
pose a recursive method for the estimation of the system parameter. Section 
3 recalls some other methodologies well known in the literature which will be 
compared with our estimator in a simulation study, in order to assess their per- 
formance in the operational context we have described above. The results we 
obtained will be shown in Section 4. 

2. A recursive estimator for the system parameter of dynamic 
linear models 

Dynamic linear models give a relationship between a sequence of c-variate, 
c > 1 observable random vectors and a sequence of A:-variate A: > 1 unobserv- 
able random vectors Xt (x is usually referred as state vector), t — 1, 2, ... . This 
relationship is determined by the following stochastic system: 



Xt = MtX(_i-fUt (1) 

Yt = HtXt + Vt (2) 

where Ht is a known matrix for all t, V( ~ WN{0, Rt) is the vector of measure- 
ment errors and Ut ~ iyA/^(0, Qf), indicates system noise. It is usually assumed 
that random errors V( and uj are mutually uncorrelated, that Ut is uncorrelated 
with Xj for all j < t, t — 1,2,..., and V( is uncorrelated with Xj for all j and 
t. Let Yt = (Yt_i, Yt) be the information set available at time t, with Yq rep- 
resenting initial, information, and (yi, y 2 , • • • , Yt) be the series of the observed 
values. If Rt, Qt and M( are known, the Kalman filter (Kalman and Bucy 1960), 
provides updating equations for both the estimate of the state vector, xt|t, and the 
associated error variance matrix Pt|t. Moreover, if Ut and vt are normally dis- 
tributed, it is easy to define the predictive distribution of Yt+s and xt+s, s > 0 (a 
statistical development of this issue has been given recently by West and Harrison 
(1989)). 

In most statistical applications, the system parameter 9t = vec(MJ) is only 
partially known. To simplify the presentation, in this section we will assume that 
0t is completely unknown and propose a recursive estimator. Assume that: 



Ot = A0t-i + rjt, 



(3) 
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where A is a known matrix, and rjf WN{0, W) is a. disturbance term, uncor- 
related with measurement errors and system noise (when A is an identity matrix 
and rit is null, the system parameter is called time invariant). In this case, the 
updating equations provided by the Kalman filter can be used only conditionally 
on©t = which is a reference trajectory of the values of the sys- 

tem parameter. The recursive algorithm we propose can be obtained as follows. 
Substituting Xj as defined by equation (1) into equation (2), we obtain: 

Yt = Ht(Ifc (8> + £t (4) 

with £t = HtUt -f Vf Equations (3) and (4) define a new dynamic linear model 
where Ot can be treated as the state vector and, conditionally on a trajectory 
Xt_i = (xo, xi , . . . , Xt_i), it can be estimated using the Kalman filter. In order to 
obtain updating equations for both Xt|t and 6t\t we have to choose two reference 
trajectories: ©t and Xt_i. Such trajectories are built step by step: given some 
fixed initial values, $i and xq, at time t, t >1, the estimate 0t\t can be obtained 
by applying the Kalman filter to equations (3) and (4), conditionally on Xt_i; 
then, updating ©t-i by putting 0t — 0t\t, it is possible to get Xt|t applying the 
Kalman filter to equations (1) and (2) conditionally on ©j. So, the t-th value 
of each trajectory is given by: 0t = 0(|t(Xt_i) and xt = Xt|t(©t). Assume 
you fixed the initial values ^i|i e a;o|o. and the corresponding “prior” dispersion 
matrices Cli\i and Po|o- At time f = 1 the estimate xi|i can be obtained by 

applying the Kalman filter to eqs. (1) and (2). Hence, at time t = 2, assuming 

Xi = xi|i, the estimate 02\2 can be obtained by applying the Kalman filter to eqs. 
(3) and (4), conditionally on xi|i. This procedure is repeated at each time instant. 
The resulting algorithm is given by two mutually interacting Kalman filters and it 
is similar to the one proposed by Todini (1978). The updating equations for 0t\t 
and f2t|t are given by: 

^t\t = A0t_i|j_i -|- Kt [yt — Hf(Ifc (8> x[_j)A0t-i|<-i] (5) 

= [lfc2 ~ KfHt(Ifc <8> x[_i)] fit|t-i (6) 

where Kf = flt|<-i [Ht(Ifc ® xj_i)fi(_i|t_i(lfc <8) x't_j)'H; -I- HtQtHJ -I- Rt] 
with fifit-i = Afit-iit-i A' -I- W (clearly, updating equations for a model with 
time invariant parameter can be easily obtained by putting A = Ik and W = 0). 

The estimator we propose has the following properties: a) it is recursive; b) 
it is semiparametric (we did not make any assumption about data distribution); 
c) it does not require the solution of complex optimisation problems; d) it allows 
a quick intervention whenever structural changes happen (we will provide an 
example later). 

3. Some other estimation methods 

We compared our recursive estimator with three widely used methodologies: 
maximum likelihood estimation (ML), Bayesian posterior analysis and an estim- 
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ation method proposed by Wojcik (1993), based on the method of moments. In 
the following, it will be assumed that 6t = 0 for all t. 

Maximum likelihood estimation. ML estimation is a very well known and 
popular methodology because of its asymptotic optimal properties (Gupta and 
Mehra 1974). 

Bayesian method and Gibbs sampling. A Bayesian model might be more flex- 
ible than ML, but it generally requires heavy numerical computations. Monte 
Carlo Markov Chains are probably the most powerful tool a statistician can use 
to cope with highly complex models, but they are usually time demanding. We 
will give a brief description of the Gibbs sampler (Gelfand and Smith 1990). In 
our model the parameter of interest vs, rj) — [x', ^ = [xq, • • • , xj]'. 

The Gibbs sampler generates samples from the joint posterior distribution of 
•0 as follows. Given arbitrary starting values X(o) and 0(o), we draw X(i) from 
/ (x|0(o),y), then 0(i) from / (0|x(i),y) to complete the first iteration. After 
s such iterations we obtain (x(j), 0(i)). Under mild conditions, this random 
vector converges in distribution to the joint posterior as s — oo. Replicat- 
ing the entire process in parallel G times provides a sample of random vectors 
(x^^), j = 1, . . . , G, from the joint posterior distribution. These obser- 

vations can be used to estimate any posterior marginal moments (whenever they 
exist) or quantiles, or posterior marginal densities. Now suppose that R< and 
Qt = Q in the model defined by eqs. (1) and (2) are known for all t and assume 
we observed a sample y = (yi • • • y*)- Generation of x from / (x|0, y) can be 
obtained using the method proposed by Carter and Khon (1994). As far as the 
generation of 9 is concerned, it can be easily shown that, if a Gaussian prior has 
been elicited on 6, i.e. 6 ~ N{m.e, Ye) a priori, and defining x* = I* ® x'j_^, 
X* = [x^', . . . , Xj']^ and x = [x'l, . . . , xj]', / {9\x, y) will be the density of a 
normal random vector with variance VJ = -t- x*(I< (g) Q)”^x*' and mean 

= V*e"^ + x*'(It ® Q)~^x] . 

An estimator based on the method of moments The estimator we are intro- 
ducing now has been defined by Wojcik (1993). It is considered here for ease of 
implementation and also because it is a consistent estimator. However, it requires 
some severe restrictions. Assuming that yt (which, in this approach, is assumed 
to be a scalar quantity) is second order stationary, i. e. Rt = R, Qt = Q and 
Mt = M for all t, with all eigenvalues of M strictly less than one in modulus, 
and exploiting the structure of the observability matrix and the Cayley-Hamilton 
theorem, Wojcik considers a relationship between the second order moments of 
yt and the coefficients of the characteristic equation of M. This relationship is; 







'r(i) • 


•• m 


-1 


■ r{k + i) ■ 


Oik _ 




_m ■ 


■■ r{ 2 k-l) 




T{2k) 



where F(i) = Cov{yt, yt~i) and ai,...,ak are the coefficients of the charac- 
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teristic equation of M. Substituting second order moments with their empirical 
estimates, we obtain consistent estimates of cci, . . . , o'*. The model can be suit- 
ably reparametrized via a non singular transformation of the state vector, leading 
to the so-called canonical representation, so that the transition matrix of the new 
model is the companion form of M, say M*: 

■ 0 1 ••• 0 

M*= ^ = 

0 ••• 0 1 

_ -«i -a2 ■ ■ ■ -Oik 

which can be estimated consistently. 

4. Results of a simulation study 

We applied the three methods to simulated data generated from models for 
scalar observations with the following structure for all t: 

xt = Mxt-i + But, ut ~ N(0, Q) 

yt = Hxt-hDt, vtr-^N{0,R), 

<f>l (f>2 <f>3 <t>i 

J 1 0 0 .H=[l 0 0 0 ],B=[ 1 0 0 0 ]'. 

_ 0 0 1 0 J 

We considered four models, denoted by Ml, M2, M3 and M4. In Table 
1 the values of the parameters for each of them are reported together with the 
signal-to-noise ratio Var{1ixt) /Var{yt) . For Ml, M2 and M3 we generated 
1000 independent time series of length 300. The ratio underlying the choice of 
these models is the following. It is well known that when the level of the signal- 
to-noise ratio is low we cannot achieve good estimates of the unobservable state, 
even when all parameters are known. We wanted to test how signal-to-noise ratio 
affects the performance of system parameter estimators. Moreover, we wanted 
to assess whether a high variability of the state (i.e a high value of Q in our 
simulation) implies or not a poor efficiency of the recursive estimator. 

We wanted to test also the performance of the recursive estimator when the 
system parameter changes over time. Therefore we generated 100 independent 
time series of length 2000 from M4. The initial conditions in all the applica- 
tions of the recursive estimator were fixed at: xo|o = 0, Po|o = .^fc)^o|o = 0, 
o - f 0 1 

L 0 0 J- 

Fig. 1 shows the comparison of the behaviour of the recursive estimator with 
that of ML estimator, which has been computed at 



( 8 ) 

( 9 ) 
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Table 1: Parameter values and signal-to-noise ratios for models Ml, M2, M3 
and M4. 







<h 


4>z 


</>4 


Q 


R 


Var(Hxt) 

Var(yt) 


Ml 


0.50 


-0.10 


0.07 


0.01 


0.50 


0.03 


0.955 


M2 


0.50 


-0.10 


0.07 


0.01 


0.50 


0.15 


0.811 


M3 


0.50 


-0.10 


0.07 


0.01 


2.50 


0.03 


0.991 


M4 1 < 1000 


0.50 


-0.10 


0.07 


0.01 


0.50 


0.03 


0.955 


M4 1 > 1000 


-0.50 


-0.10 


0.07 


0.01 


0.50 


0.03 


0.955 



t = 20, 40, 60, 80, 100, 150, 200, 250, 300 (the curves relative to the quan- 
tiles of the latter have been obtained by linear interpolation). The 5-th, 50-th 
and 90-th percentiles of the Euclidean normi of 0 - 6ml and 0 - Ot\t get quite 
close to each other at f = 50 and they tend to overlap for t > 300, indicating 
a good performance of the recursive estimator. Similar considerations can be 
made for the results about models M2 and M3, which are reported in Table 2 for 
t = 20, 50, 300. Bayesian analysis via Gibbs sampling gave satisfactory results 
as well, but it does not seem adequate to the application field we are interested 
in, because of the computational complexity of the Gibbs sampler. In fact, the 
analysis of a single time series of length 300 requires about 20 minutes on a PC 
with a 75 Mhz Pentium coprocessor. This is a very long time if compared with 
the speed of the recursive estimator reported below. As can be seen in Table 2, 
moderate changes in the value of signal-to-noise ratio do not seriously affect 
the behaviour of the recursive and ML estimators. The variability of one step 
ahead forecast errors is instead sensitive to the values of the variances of the dis- 
turbances appearing in eqs. (8) and (9). The behaviour of the estimator based 
on the method of moments was very unsatisfactory. This result is not surprising 
since, whenever the matrix appearing in eq. (7) is close to singularity, numerical 
instability problems may arise. For this reason the estimates assume values quite 
distant from the true ones even for time series of length 2000. As far as comput- 
ing time is concerned, ML estimation requires about 2 minutes for a time series 
of length 300 using scoring algorithm, whereas the recursive estimator produces 
its output in 0.005 minutes. 

We find very interesting the performance of the recursive estimator when a 
structural change happens. In the estimation of the system parameter for model 
M4, we followed two alternative strategies: in one case we used the recursive 
estimator as defined in Section 2 (no correction), in the other one we increased 
the dispersion matrix of Otatt — 1100 by putting finoo|iioo = 0.5 fIo|o, in order 
to accelerate the learning process of the recursive estimator. The result is shown 
in Fig. 2: the good performance of the estimator in adapting to the new situation 
is apparent. The behaviour of innovations is shown in Fig. 3, and it suggests that 
a careful investigation of innovations behaviour might give useful insights as far 
as the detection of model changes is concerned. 
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Table 2: Comparison of 5-th, 50-th and 95-th quantiles of\\0 — 0|| obtained 
through recursive (HE) and maximum likelihood (ML) estimators. 







Ml 


M2 


M3 






RE 


ML 


RE 


ML 


RE 


ML 


II 

fcO 

o 


90.50 

90.95 


0.435 

0.777 


0.351 

0.542 


0.463 

0.822 


0.365 

0.580 


0.432 

0.822 


0.325 

0.580 


t = 50 


90.50 

90.95 


0.287 

0.526 


0.219 

0.426 


0.326 

0.586 


0.281 

0.461 


0.270 

0.489 


0.216 

0.383 


< = 300 


90.50 

90.95 


0.114 

0.212 


0.095 

0.186 


0.147 

0.282 


0.123 

0.266 


0.113 

0.205 


0.100 

0.197 



Figure 1: Model Ml: (a) 5-th, 50-th and 95-th percentiles of\\0- dt\t\\ (bold 
lines) ||0 - 6ml\ \ (dotted lines); (b) 5-th, 50-th and 95-th percentiles of innova- 
tions when RE is used. 

(a) (b) 




Figure 2: Parameter change: plot of 5-th, 50-th and 95-th percentiles of\\6— Ot\t \ \ 
with and without correction. 




5 . Conclusions 

The results of our simulation study shed some light on the applicability of the 
recursive algorithm we propose for the estimation of the system parameter. In the 
examples we examined signal-to-noise ratio did not seem to affect strongly the 
performance of the estimator. Moreover, we can hypothesise that the asymptotic 
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Figure 3: Parameter change: plot of 5-th, 50-th and 95-th percentiles of innova- 
tions with and without correction. 




behaviour of our recursive procedure might be very close to ML estimation. In 
the context of environmental monitoring our method has some advantages over 
ML. In fact, ML requires some complex optimisation procedures which need to 
be carefully checked in order to avoid convergence to local maxima. Moreover, 
it is difficult to define the likelihood of the model when some parameter changes 
happen if change points are unknown. Our method, instead, being recursive, is 
very quick to implement and may adapt rapidly to parameter changes throug a 
very simple intervention. Moreover, it does not require any assumption about 
data distribution. 
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Abstract. We propose kernel methods for estimating covariance functions, 
when the data consists of a collection of curves. Every curve is modelled 
as an independent realization of a stochastic process with unknown mean 
and covariance structure. We consider a kernel density estimator, which 
has the positive semi-definiteness property on the ’’time” points and also in 
the continuum. We describe a cross-validation procedure, which leaves out 
an entire curve at a time, to choose the bandwidth (smoothing parameter) 
automatically from the observed collection of curves. 

Key words: Covariance, Cross-validation, Functional Data Analysis, Ker- 
nel Density Estimation, Smoothing. 

1. Introduction 

In many research areas, experiments typically produce response which are 
curves, rather single data points. For instance, curves arise naturally as 
observations in the investigation of growth, in survival analysis, in signal 
processing, and more generally in the interpretation of automated on-line 
data, where a separate curve is observed for each individual/miit in a sam- 
ple. A more detailed account of the statistical analysis of curves (named 
Functional Data Analysis) may be deduced from Ramsay (1982), Ramsay & 
Dalzell (1991) and Rice Sz Silverman (1991). 

Suppose that a collection of n curves is available, where each curve is 
observed at ’’time ” points ti, . . . ,tp. Assume that the sample curves are 
independent realizations of a stochastic process A(t), with unknown mean 
/x(t) = E {A(t)} and unknown covariance function 

p(s,t) = COU{A(s),X(t)}, 

for general (s,t) and without structural assumptions about p. The U's are 
not necessarily evenly spaced. The jth point on the zth curve will be in- 
dicated by Xi{tj) = Xij. We wish to estimate the covariance fimction p, 
nonparametrically, from the n observed curves. 

Related work may be considered, for instance, Azzalini (1984), Haxt & 
Wehrly (1986) and Diggle & Hutchinson (1989), when the curves arise from 
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time series or repeated measurement data, and Diggle et al. (1987), Berman 
& Diggle (1989) and Sampson &: Guttorp (1992), when the curves exhibit 
some spatial interdependence. 

In particular, in section 2 we describe our basic kernel density estimator 
of the covariance fimction p. In section 3, we describe a cross-validation 
procedure for an automatic choice of the corresponding bandwidth. In sec- 
tion 4, we obtain a version of this kernel density estimator with the positive 
semidefiniteness property in the continuum. Finally, in section 5 we show 
the effectiveness of the entire procedure on a real example. 



2. Kernel estimation of covciriance functions 



Hall et al. (1994) have proposed a kernel method for estimating the co- 
variance function p of a stochastic process. We try to extend their basic 
estimator to the case of n curves. 

For every z = 1, . . . , n, we denote by 

n n 

E E (^*1 - ’ 

2=1 2=1 



the sample mean of Xi{tj) and the standard covariance function between 
Xi{tj) and Xi{tk), for every = ,p. 

Let K denote a bivariate kernel function, which is taken to spherically 
symmetric (cf. Wand &: Jones (1995), section 4.2). Let H = {hi, be the 
bandwidth or smoothing parameter. A kernel estimator of the covariance 
function p may be defined as 



p(s,t) 






S tj t 
hi ’ /i2 






S tj t t]£ 
hi ’ /i2 



-1 



( 1 ) 



Note that p(5,t) is defined, without assuming that p{s^t) admits an or- 
thogonal expansion in terms of a specific set of eigenfunctions (cf. Rice & 
Silverman (1991) and Ramsay Sz Dalzell (1991)). 

Assume that U = Xui, where A is a positive number, such that A = 
X{p) oo as p ^ oo, and ixi, . . . ^Up are observed values of independent 
random variables J7i, . . . , C/p, all having distributions not depending on p and 
the process X. Alternatively, we may take Ci, . . . , tp to be evenly spaced on 
an interval of width A. For simplicity, we take a bandwidth H with equal 
components, i.e. 



h\ = h2 = h . 
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Under suitable regularity conditions on the distributions of {7i, . . . ,Up 
and the kernel K, it follows that 

E {p{s, t) - p{s, t)}^ = q{s, t) {h* + A~^) + o{h^ + A“^) , (2) 



where 




Estimator 'p{s,t) converges to the true covariance function p{s,t) in L^, for 
every (s, t), as phX~^ —>■ 0. Under the condition that the process is observed 
over an increasingly wide range of time points, the kernel density estimator 
p(s,t) consistently estimates the covariance function p{s,t), for every (s,t), 
provided the bandwidth is properly selected. 

A more complicated asymptotic setting is needed, if we include the j — k 
terms in sums defining the kernel density estimator (1). However, the re- 
sulting alternative kernel density estimator would have a similar asymptotic 
behaviour, without remarkable advantages in terms of mean square error. 

3. Bandwidth selection by cross-validation 

Bandwidth h may be chosen by a cross-validation procedure, which works by 
leaving out an entire curve at a time, instead of leaving out a single ’’time” 
point. A general introduction to cross-validation methods may be found in 
Wand & Jones (1995), chapter 3. 

For every j,k = 1,. . . ,p, with a slight abuse of notation, we denote by 

the estimate (1) of the covariance function p{tj,tk), obtained by applying 
(1) to all the curves, but the ith curve. The cross-validation score may be 
then defined as 



= E E • (3) 

2=1 j^k 

A bandwidth selector h for h in (1) is the minimizer of (3). 

Note that Xjk is a standard estimator of the covariance between time 
points tj and tk, where j,k= 1, . . . ,p, borrowed from classical multivariate 
analysis. Two different time points tj and tk are regarded as labels of two 
dependent n- variate random variables. Choice of h determines the smoothed 
surface p, which interpolates points Xjk, where j,k = 1, . . . ,p. The mini- 
mization of the cross-validation score (3) produces the surface p, which turns 
out to be, in a certain sense, optimal. 




322 



Typically, this cross-validation procedure demands an efficient munerical 
minimization algorithm (cf. Press et al. (1992), chapter 10) as a necessary 
ingredient. If a unique minimizer does not exist, h may subjectively be 
chosen by plotting S given by (3), as a function of h, for a convenient set 
of candidates. A subjective choice for h could be, for instance, the value h 
producing a desired level of smoothing in surface p{s, t). A subjective choice 
for h coul also be the value h able to produce a surface p(s, t), which well 
approximates the behaviour of the standard estimator Xjk, from a specified 
region of points (s, t) of interest. 

We do not discuss here the question of selecting an optimal bandwidth, 
able to mi ni mize the mean square error in (2) and maintain the consistency 
of p(s,t) for every (s,t) as ph\~^ — > 0. 



4. Ensuring the positive semi-definiteness property 

Estimator p(s, t) given by (1), is not necessarily itself a covariance function, 
since it does not typically satisfies the positive semi-definiteness property 



p(s, t) w{s)w{t) dsdt>0, 



for all integrable functions w. Hall et al. (1994) suggest a procedure for 
making a kernel estimator of the covariance function p itself a covariance 
function. The rationale behind that procedure can be applied to estimator 
(1) as well. 

By Bochner’s theorem (cf. Bhattacharya k, Waymire (1990), pages 663- 
664), property (4) is equivalent to nonnegativity of the Fourier Transform 
of p, that is 

(^1,^2) > 0 , 

for all 9 1 and 9 2 , where 




To obtain (5) from (1), we first define the function 

' p(s,t) 0 < s < 5i, 0 < t < Ti 

p(^) ^ 1 ) T2-Tx 0 < S < S'!, Ti < t < T 2 

?i(s,«)=< s, < s < 82,0 <t <Ti 

S,<s<S 2 ,T,<t<T 2 

^ 0 elsewhere , 



( 6 ) 
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where S'i,52,Ti and T 2 work as truncation points. Surface p{s,t), outside 
of the region {0 < s < 5i, 0 < t < Ti}, is substituted with planes (whose 
equations are given in (6)) or with the value 0. 

We assume the conditions for points t\, . . . ,tp, stated above in section 
2. Under further regularity conditions on the distributions Ui,... ,Up and 
the kernel K, it can be shown that Pi{s,t) consistently estimates the true 
covariance function p{s,t), for every (s,t). 

We put 



noo 

Pi{s,t) cos{6is + 92t) ds dt , 

. 

and finally define 

Q* Q* 

Pi(s,t) = (27 t)“^ / [ p\{0i,92)d0id62, 

J-91 J-ei 



(7) 



where, for all 0, 



0* = inf|0i>O: pl(^i,0)>o} , 



z = 1,2. 

Estimator (7) is a version of (1), which always satisfies the positive semi- 
definiteness property (4). 

5. An exEunple 

We consider a data set on the weights (kg) of 48 pigs in 9 successive weeks 
presented in Diggle et al. (1994), pages 34-46. In figure 1, we plot the 
corresponding n = 48 growth curves, where the lines connect the repeated 
observations for each pig. 

Note that the ’’time” points, ti, . . . ,tg say, are evenly spaced, satisfying 
conditions for consistency of estimators (1) and (7). The underlying stochas- 
tic process is apparently nonstationary, with increasing mean function p{s) 
and variance function p(s, s). Again see figure 1. 

In order to determine the basic kernel estimator (1) of the covariance 
function p, we have used the product Gaussian kernel function given by 

A(s,t) = (27T)-'e-*'/2g-tV2 

Numerical minimization of the cross-validation score (3) has yielded the 
bandwidth selector h = 0.75. The trvmcation points chosen to define esti- 
mator Pi given by (6) are Si = T\ = 2.5 and S '2 = J 2 = 3.1. Estimator (7) 
has been obtained by standard EFT and numerical integration algorithms. 
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Figure 1: Weights (kg) of 48 pigs in 9 successive weeks. 




The behaviour of estimator Pi(s,t) given by (7) is finally displayed in 
figure 2, where points (s,t) are from the region [1,9] x [1,9], the bottom 
vertex being the point (1,1). Note that the covariance estimate Pi{s,t) is 
given as a continuous function of the two variables s and t, where (s, t) G 
[1,9] X [1,9], 

The covariance function estimate exhibits an increasing variance function 
pi{s,s) (cf. figure 1), as s increases. Set 

/ = { 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 } . 

We have estimates of the covariance function p{s, t) also for points (s, t) G 
[1,9] X [1,9], which have not been observed, namely, which are different 
from (tj, tk) € I X I. For small values of s and t in (s,t), estimator Pi(s, t) 
produces values, which are relatively far (with the present example) from 
values Xjk, for every j,k — 1, . . . ,q. For successive values of s and t in 
(s,t), estimator Pi{s,t) has a better performance, in this sense. 

A value for 'pi{s, t) means estimate for the covariance between X{s) and 
X{t). As shown in figme 2, the covariance point estimates are nearly sym- 
metrical, that is, it turns out that 

Pi(s,t) «pi(t,s). 
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Figure 2: Covariance function estimator ^{s,t), {s^t) G [1,9] x [1,9]. 




for every (5,t),(t,5) G [1,9] x [1,9]. 

6. Conclusions 

We have proposed a kernel density estimator for the unknown covariance 
function of a stochastic process. The kernel density estimator does not 
need settings, where stationarity of the process is crucial. This point partly 
explains our choice of a surface, instead of a curve, to represent the estimated 
covariance function of a stochastic process of interest. 

It is well known that the so-called positive semidefiniteness property 
characterizes a covariance function. An estimator for a covariance function 
is typically preferable, when it satisfies this property, simply because it is 
itself a covariance function. We have applied the procedure by Hall et al. 
(1994), in order to make our basic kernel density estimator fulfil the positive 
definiteness property. 

The proposed estimator is very computationally intensive and may not be 
competitive with others. We are thinking about examples, whereas a more 
standard estimator could be effective as well. A faster estimation procedure 
may probably be achieved by applying a different and more efficient cross 
validation technique. 
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Abstract: A regression analysis could fail if the sample is actually composed of 
more subsamples. We show that the regression function plot is a powerful tool 
to detect such a feature in the data. Its behaviour when more subpopulations are 
present is investigated in the framework of link-free regression analysis. A 
dynamic graphics procedure to detect the coexistence of more subsamples in the 
data is proposed. 
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1. Introduction 

In regression analysis, it is usually assumed that an unique model describes 
the whole data set. However, sometimes regression data allow several fits due 
to the presence of more subpopulations in the sample. Each of them could 
require different models with different parameter values, and they will cause the 
failure of an overall analysis. Hence, given a sample for a regression analysis, 
the point is to assess if it is composed of more than one subsample. It is aim of 
this paper to give tools to explore if our data cloud is actually composed of 
more data clouds in some way mixed. 

Several numerical approaches have been considered to address this problem. 
Morgenthaler (1990) proposed a method in die framework of robust analysis, 
while some kind of regression tree techniques could be used as well (e.g. 
RECPAM, Ciampi (1994)). Both, however, use regression models with a given 
link function. 

In this paper, we propose a dynamic graphics procedxire that performs an 
exploratory analysis within the broader environment of link-free regression 
model. As graphical tools, we suggest to use the regression function plot along 
with the scatterplot matrix. We call regression function plot the plot of the 
response against a linear combination of the predictors. In linear regression, it 
becomes the response against fitted values plot, while in simple regression it 
corresponds to Ae scatterplot of the response variable against the predictor. 
Cook and Weisberg (1994) call it a Summary Plot in the regression problem, 
and stress its use as an important tool in regression graphics. This plot could be 

* Work supported by ex-40% MURST Research Project “Nuovi Metodi di Classificazione e 
Analisi dei Dati”. 
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used to assess the correct functional form required to describe the data, and it 
can highlight die presence of outliers. In this paper, We propose to use such a 
plot to detect subsamples in link-free regression analysis. 

In Section 2 we present the regression function plot and show how 
subpopulations could appear in its population version. In Section 3 the sample 
version of the plot is investigated, and in Section 4 a dynamic graphics 
procedure to detect subsamples in regression data is proposed. In Section 5 an 
example is presented, while in Section 6 some concluding remarks follow. 



2. Subpopulations in the regression function plot 

(liven a response random variable r eiR* and a set of p random predictors 
^€91^, the goal of a regression analysis is to study the conditional distribution 
Ffy I OC =Jc), and in particular the regression mean function E(r| Jf=jc). A general 
model for this distribution, that includes many regression models, could be: 



Y{y\‘X=x) = Y(y\‘X = x*^), (1) 

with p a pxl vector of parameters. A corresponding model for the regression 
mean function could be: 



E(y k =jc) = E(y k = a:’P) = gfjr’P), (2) 

where g(«) is an unknown link function. 

This kind of modelling has been called link-free regression analysis because, 
unlike the standard regression model, we want to make inference not only on 
the parameter vector p, but on the link function g as well. The standard linear 
regression analysis is included in such a framework with g(x’P) = a+ jc’p. 

Given model (1), the distribution of y\ ^ =x is the same as the distribution of 
y k = jc’P, for each x. Hence, only one linear combination of the predictors is 
needed to extract from “X all of the information about the distribution of y| % 

As a consequence, the plot {¥, ;TP} visualize the regression mean function 
(2), even in the case of more than one predictor. Geometrically, this plot 
represents a projection of the data in the plane from which the regression 
hyperplane appears as a straight line. Here, by regression hyperplane we mean 
the one that “best” fit the data, even if the regression function is not linear. 

This feature enables us to visualize the unknown link function and gives 
tools to infer about it - see Cook and Weisberg (1994, Chap. 7, 10) for more 
details. This is the reason why we will call such a plot the “regression function 
plot”. Conditions on the reliability of regression plots investigated by Cook 
(1994) still hold for the regression function plot. 
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First, we will describe such a plot for population data (we refer to it as 
population plot in the follow), and we will investigate its appearance when 
more populations are present. 

For the sake of simplicity, let us consider the two populations case. The more 
than two populations case can be addressed similarly. Using the above notation, 
we will assume respectively the models and the mean functions: 



F,<yk=Jc) = F,<yk=Jc’p,), 


i=l,2 


(3) 


E,<y k =x) = E,<y k = ac’pi) = g,<x’p,). 


/=1,2 


(4) 



where subscript i-l, 2 denotes respective populations and E/ expected values 
with respect to the i-th distribution function. 

Two possible cases can arise and the way they could appear on the 
population regression function plot {/, Jf P) is different: 

i) The two populations parameters are equal up to a proportionality 
constant c (P; = cP^). 

In this case, (unless gi(*)=g2(*), and Pi=P2) the two populations will just 
appear along two different regression functions in the plot {/, Jf’p}. 

It is worth noting that as a special case (c=l, gi(*)=a+g 2 (»)) we have shifted 
regression function: two mean functions of the same shape but with different 
intercq)t will explain the data at hand. If both these regression functions are 
linear, the two data clouds will lie along two parallel hyperplanes. In this case, 
looking at the {Y, plot, an overall hyperplane will be a straight line lying 
between the parallel ones. The specific position of the data, of course, will 
depend on the population sizes and variances. 

ii) The two populations have P/ CP 2 . 

In such a case we do not have an unique projection direction to visualize the 
regression functions. However, if pi is close to P 2 , for some directions p* 
between Pi and P 2 , the plot {/, Jf’P’"} could give an interesting view of the data 
as the populations can still show a different shape in this projection direction. 



3. Subsamples in the regression function plot 

As a matter of fact, to visualize the regression function shape is enough to 
consider a value proportional to p: the regression function plot {/, ^’p} shows 
the same regression function shape as the plot {/, it’yP}, where y is any 
proportionality constant. Hence, for our pmposes, an estimate for p is 
equivalent to an estimate for yp. Now, even if the link function is unknown, 
under elliptically distributed predictors, the ordinary least squares estimator is 
Fisher- consistent for yP in model (1) (Li and Ehian, 1989). 
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Let y and X be the «xl vector and the nxp matrix of observed values, 
independent distributed observations on the mixltivariate random vector (y, 

We will denote as p an estimate for yP in model ( 1 ), and the regression 

function plot in its sample version will be {y, Xp }. 

When more than one population is present, the regression function sample 
plot {y, XP} shows different features. From now on, assume that data are 

drawn from two different populations and an unique overall estimate P from 
our sample is available. 

Let us consider first the case when the populations parameters are equal up to 
a proportionality constant (Pi = cPa). 

Using on the whole sample a Fisher-consistent estimator for yP in ( 1 ), will 
yields a suitable estimate p for the direction Pi = CP2: the two subsamples will 

appear along two different regression functions in the plot (y, Xp }. 

If we have two shifted linear regression function the two data clouds will lie 
along two parallel hyperplanes. In this case, looking at the (y, ATp } plot, the 
overall estimated hyperplane will be a straight line lying between the ones that 
could be fitted on the two subsamples. Hence, it is reasonable to expect that 

data from two subsamples will be split approximately, in the {y, .Yp } plot, 
above and under the single estimated line. Their specific position, of coru^e, 
will depend on the unknown subsample sizes and variances. 

When the two subsamples come from populations with pi ^ c^2 (case ii) 
previously analysed), the P estimated from the whole sample will be neither a 
suitable estimate for the direction Pi nor for CP2. 

Therefore, Ihe plot {y, } does not give a ‘correct’ representation of both 

the regression functions as the direction ATp is any of the possible directions in 
the predictors space. Notwithstanding, this plot could give an interesting view 
of the data as the subsamples can still show a different shape in this projection 
direction even if less clearly than in the previous case. Closer Pi is to CP2, better 
die overall estimate should in some way discriminate the two regression 
functions. 

To summarize, given a sample derived from two populations, two regression 
functions appear more or less clearly in the regression function plot (y, Xp }. 
However, since which observations belong to the subsamples is unknown, it 
could be very difficult to identify even approximately the subsamples. 

In order to give tools to detect and identify subsamples, in the next section a 
d)mamic graphics procedure is proposed. 
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4. A dynamic graphics procedure 

Goal of the proposed procedure is to find data subsets such that their 
regression fimctions are different from each other. Since few data points with 
such a feature can be detected by standard outlier analysis (Chatteijee and Hadi, 
1988), the proposed procedure search for an appropriate broad partition in the 
predictors space or along variables not used in the analysis (both referred as 
splitting variables in the following). 

From the practical point of view, since we are analyzing real data observed 
on the same variables, it seems reasonable that we will have both approximately 
the same shape in the regression fimctions (i.e. g^i(«)=g 2 (*), perhaps with 
different parameter values), and pi close to cP 2 . 

Hence, at first we suggest to linearize the whole sample regression fimction 
by a unique inverse transformation of the link fimction, as in the standard link- 
fi'ee regression analysis (Cook and Weisberg 1994, Chap. 10). Then, to check 
for subsamples using the regression fimction plot in concert with the splitting 
variables scatterplot matrix. Once the data have been divided into subsamples, 
two regression analyses on the two subsamples should be performed: their 
comparison will evaluate the effectiveness of such a broad partition. 

The proposed procedure is then as follows: 

Step 1. Get an estimate p from the whole data set. 

Step 2. Linearize, as possible, the regression fimction by inverse link 
fimction transformation. 

Step 3. Display jointly the plot (y, Ap } ^d the scatterplot matrix of the 
splitting variables. 

Step 4. Iteratively and interactively: i) Look for cluster and/or patterns in the 
plot {y, XP }. ii) If any, use d 5 mamic selection to check if they correspond to 
cluster in the splitting variables, iii) Select data subsets by slicing (for different 
slice window width) or brushing in the scatterplot matrix cells, iv) Look for a 

pattern of the selected points in the plot (y, Xp } . 

Step 5. If in the plot {y, Ap } such a pattern is found, try to model it by a 
dummy variables for the identified subsamples or try to.fit different models. 



5. An example 

As an example for the proposed procedure, we will use a data set from the 
Minnesota Dep. of Children, Families and Learning referred to 46 schools in 
Minnesota for the 1994-1995. (Source: Miimeapolis Star and Tribune, March 
18, 1996). The response variable is the percent of graduating students who 
attend four year college (P4Y), while the predictors are characteristics of the 
school: score at the ACT test, percent of minority students, cost of the student, 
and percent of students with free limch (Pfree-lunch). Ordinary least squares 
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estimation and link function inverse transformation were performed. As a result, 
P4Y^ is linearly and positively related with the first three predictors and 
negatively with the forth. Numerical results are given in Table 1. 

T^^el^Ji^imesotaSchools^data^Re^^res^ 

Coefficient Estimates Summary Analysis of Variance Table 

Estimate Std. Error t-value Source df SS MS 

Constant -4972.39 3830.84 -1.298 Regression 4-73226848. 18306712. 

ACTscore 339.227 164.798 2.058 Residual 41 25478270. 621421. 

Cost 0.725155 0.259484 2.795 R^: 0.741875 

log[PffeeJunch] -1492.20 219.716 -6.791 

logrPMinorityl 364.570 179.583 2.030 F: 29.46 p-value: 0.0000 



The /7-value for the F test is pratically equal to 0. However, the regression 
function plot {y, } shows an S-shape pattern of the data (Fig. 1(a), top): the 

model does not completely describe the data. A further power transformation 
does not work, while a more complex nonlinear function g(*) is not worth as it 
could require too many parameters in the model. 



Fig.l: Minnesota Schools data. Predictors scatterplot matrix and plot {y, Yp }. 




Instead, when our procedure is used, by dynamic selection (Step 4, i-ii), we 



note that the bottom of the S-shape in the plot \y,X^} corresponds to values in 
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the upper half range of the variable Pfree-lxmch (Fig. 1(a)). Hence, according to 
our procedure (Step 4, iii-iv), we check this suspected structure and note that 

higher values of this variable correspond to a pattern in the plot (y, Xp } 
(Fig. 1(b)). This pattern could be then partially explained by two different 
models for lower and higher values of the variable Pffee-lunch, a variable 
revealing the economical and social level of the students in the school. Refitting 
the model separately for the subsets of data with lower and higher value of 
Pfree-lunch we found that the dependence of P4Y on all the predictors is 
stronger when the level of Pffee-lunch is low, whereas for higher values of 
Pffee-lunch the F-test for regression yields a p-value of 0.1082 (Table 2). 



Table 2: Minnesota Schools data. Two regressions for Pfree-lunch values. 





Pfree-lunch lower value 23 cases. 


Pfree-lunch higher value 23 cases. | 




Estimate 


Std. Error 


t-value 


Estimate 


Std. Error 


t-value 


Constant 


-18079.4 


5310.89 


-3.404 


3375.46 


5270.19 


0.640 


ACTscore 


898.663 


246.780 


3.642 


26.6400 


206.643 


0.129 


Cost 


0.718061 


0.332047 


2.163 


0.429684 


0.413189 


1.040 


log[PfreeJunch] 


-1298.91 


269.546 


-4.819 


-1482.39 


646.305 


-2.294 


log[PMinority] 


450.179 


209.713 


2.147 


340.745 


301.846 


1.129 


R^: 


0.88077 






0.329808 






Anal, of Variance: 


Source 


df SS 


MS 


Source 


df SS 


MS 




Regression 


4 59074818. 14768705. 


Regression 


4 5429068. 1357267. 




Residual 


18 7997004. 444278. 


Residual 


18 11032255. 612903. 




F: 33.24 


p-value: 0.0000 


F: 2.21 


p-value: 0.1082 



At a first glance, it seems that the percent of graduating students who attend 
college does not depend globally on the recorded characteristics of the school, if 
the economical and social level of the students is low. As a conclusion, the 
coexistence of two subpopulations gives reason to the contradictory presence in 
the whole sample of a 0 /7-value for the F test and a non linear pattern in the 
regression function plot. 



6. Remarks 

We have introduced the regression function plot as a tool to detect 
subsamples in link-ffee regression analysis. As a particular case, the proposed 
procedure -omitting Step 2- works in standard linear regression analysis as well. 

For visual enhancement, at a first moment the plot {XP , y} could be used in 
the suggested procedure, instead of the regression function plot. 

Only subsamples with different regression mean function E(y| ^ = jc’P) have 
been considered. However, the ideas are straightforwardly extendable to the 
other regression moment functions such as the regression variance function 
Var(H5I^A:’p). 
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It is worth noting the case when the subsamples regression functions are not 
dependent on the same predictor variables. Let % be a subset of X. for the f-th 
subsample we can have 1 Jf ) = E,{y \ = E,{/ 1 jc’P,) with some values 

in p,- equal to zero. In such a case Pi * c^ 2 , unless c is close to zero and then 
E,(y| Jf ) = E,{>). For our procedxire to work, we only need ) c 

In the case we have more than two subsamples in the data, our procedure 
could work as well. On the other hand, to investigate if any subsample is in turn 
composed by more ‘sub-subsamples’, the iteration of the procedure is 
recommended 

Acknowledgments. Thanks are due to J. Antoch and G. Galmacci for many 
helpful comments on an earlier version of this paper. 
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Abstract: In this paper we derive the asymptotic posterior distribution, 
in a conjugate analysis, for the marginal and partial correlation coefficients 
in a graphical Gaussian model. An example of prior to posterior analysis is 
given and the problem of the specification of the hyper parameters discussed. 
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1. Introduction 

Recent work established a class of statistical models, known as graphical 
models, that exploit the close relationship between conditional independence 
and separation in undirected graphs (see Lauritzen, 1996). 

The conditional independence structure of a multivariate Normal distri- 
bution is dictated by the zero pattern of its concentration matrix The 
use of the concentration matrix in the parametrisation of the normal distri- 
bution has many advantages, however non-zero elements of are difficult 
to interpret since they are unnormalised quantities. Consequently, when the 
interest is on the strength of the association structure, this parameters have 
to be transformed to obtain partial correlation coefficients. 

In this paper we apply the results of Roverato and Whittaker (1996, 1998) 
to derive the asymptotic posterior distributions, in a Bayesian conjugate 
analysis, of the marginal and of the partial correlation coefficients. We give 
an example of prior to posterior analysis based on real data where we discuss 
the difficulties concerned with the specification of the hyper parameters. 

The notation is given in Section 2. In Section 3 we present some basic 
theory relating to graphical Gaussian models and in Section 4 we describe 
the HIV data used in the application. In Section 5 we derive the required 
asymptotic distributions. Finally in Section 6 we carry out the application 
and discuss the problem of the specification of the hyper parameters. 

2. Notation 

Let V be a finite set with |V| = p, and let F be a p x p symmetric in- 
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vertible matrix. The rows and columns of F are indexed by the elements of 
V, so that r itself is indexed by V x V. When V = {1, . . . ,p}, F is indexed 
by row and column numbers. 

The Isserlis matrix of F, Iss(F), (Isserlis, 1918; Roverato and Whittaker, 
1998) is the symmetric matrix indexed by W x W where W = {{i,j) ■ i,j € 
V, i<j}, with elements {Iss(F)}(.^^.) = jirjjs + lisljv 

The graph theory requisite for graphical models may be found in Lau- 
ritzen (1996). We use the convention that, for an arbitrary undirected graph 
G = (Vi V) where V is the vertex set and V is the set of edges, for alH € V 
the pair {i,i) is included in V and that if {i,j) € V then i < j. The set W 
is therefore the edge set of the complete graph and we denote by V = W\V 
the set of edges not in G. 

EXAMPLE 1 With V = {1, 2, 3} and graph G Q (D @ 

the edge set is V = {(1, 1), (2, 2), (3, 3), (1, 2), (2, 3)} while V = {(1, 3)}. □ 



For any undirected graph G = (V, V) the pair (V, V) is a partition of W. 
To this correspond the submatrices Iss(F)vv> Iss(F)yy and Iss(F)yi> as well 
as the partial matrix Iss(F)yy|v = Iss(F)w — Iss(F)yy[Iss(F)yy]“^Iss(F)yy. 

For a set C C W we define the C-incomplete matrix F^ as the symmetrised 
matrix indexed by V x V with elements { 7 ij} for all (i,j) € C, and with 
the remaining elements unspecified. In the Example 1 above the incomplete 
matrices corresponding to the sets V and V are respectively 



( Til 7i2 * 

721 722 723 

* 732 733 / 



and 



F^ = 



* * 7l3 

* * * 

731 * * 



where asterisks denote unspecified elements. The matrix F"’^ is a shorthand 
for (F“^)^. If it is possible to fill an incomplete matrix F^ to obtain a (full) 
positive definite matrix we say that F^ admits a positive completion. 

Let F'^ be a V-incomplete matrix, with G = (W), which admits a 
positive completion. We say that Fg is the completion of F'^ if it is the 
unique positive definite matrix such that 



(Fg)'^ = F'^ and {F5i}y = 0 forall (i,j) G V. (1) 



See Grone et al. (1984) for a proof of the existence and uniqueness of such 
matrix. 

We denote by diag(F^)yy the matrix indexed by V x V with the distinct 
specified elements of F'^ in the main diagonal and zero elsewhere. 

For an undirected graph G = (V, V), we denote by A4*(G) the set of all 
V-incomplete matrices and by (G) the set of all V-incomplete matrices 
that admit positive completion. Furthermore we denote by Aio(G) the set 
of all symmetric matrices indexed by V x V with element (i,j) equal to zero 
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whenever (i,j) € V. The intersection of Mo{G) with the set of all positive 
definite matrices is denoted by Afo (<j)- 

The trace of the product of two square matrices is tr(«5r) = </>y 7 y ■ 

If $ G M.q(G) only the specified elements of enter into this sum, and so 
we can write tr(#F) = tr($F^). 



3. Graphical Gaussian models 



We consider graphical models for a random vector Xy with distribution 
Pv on an undirected graph G = (V,V). We say that the distribution 
Pv is Markov with respect to G if Xa is independent oi Xb given X 5 , 
Xa-^LXbIXs [Pv], whenever S separates A from B in G. (For the Gaussian 
case that interests us, the global, local and pairwise Markov properties are 
identical, see Lauritzen 1996.) 

A graphical Gaussian model is defined as the intersection of the set of 
Markov distributions relative to G and the set of p-variate normal distri- 
butions with mean equal to zero and variance matrix S, which we assume 
positive definite. An off-diagonal element a'-’ of is zero if and only if 
[Pv]- Thus in graphical Gaussian models the pairwise con- 
ditional independence structure of Xy is dictated by the zero structure of 
so that S = E (3 is a G-completed matrix. 

A natural measure of the interaction represented by the edge {i,j) of the 
graph is given by the partial correlation coefficient 



Pij-V\{i,j} 



t*3 






( 2 ) 



which is zero if and only if (i, j) ^ V. 

The moment parameter of the distribution is G M^(G) while the 
canonical parameter is G Mo(G), the inverse of the completion of 
The likelihood function of based on a random sample = 
(A\...,A") fromPv is 

L(S-^) (X exp {-|tr(S-^5) -f | log |S-'|} , (3) 

where S is the sample variance matrix. In our notation, the maximum likeli- 
hood estimate (Speed and Kiiveri, 1986) is the inverse of the completion 
of5^. 



4. The HIV study 



Table 1 presents some summary statistics for six variables measured in Genoa 
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Table 1 : Summary statistics for the HIV data: sample variances (main diag- 
onal), correlations (lower triangle) and partial correlations (upper triangle). 



Xi 


8.8374 


0.479 


-0.043 


-0.033 


0.356 


-0.236 


X 2 


0.483 


0.1919 


0.068 


-0.084 


-0.224 


-0.110 


X 3 


0.220 


0.057 


8924231.9 


0.085 


0.552 


-0.330 


X 4 


-0.040 


-0.133 


0.149 


20392.4 


0.091 


0.013 


X 5 


0.253 


-0.124 


0.523 


0.179 


1952795.2 


0.384 


Xe 


-0.276 


-0.314 


-0.183 


0.064 


0.213 


1.378 




Xi 


X 2 


Xs 


X 4 


^6 


Xs 



and Padua paediatric hospitals on 107 three month old babies. These data 
come from a larger Italian study investigating early diagnosis of HIV infec- 
tion in children from HIV positive mothers. The variables are related to 
various measures on blood and its components: Xi and X 2 immunoglobin G 
and A, respectively; X 4 the platelet count; A 3 , X 5 lymphocyte B and T 4 , re- 
spectively; and Ae the T4/T8 lymphocyte ratio. (For a detailed description 
of these data see Boccuzzo, 1991.) 

Discussion with the experts running the study suggests the presence of 
a strong association between variables Ai, A 2 and between variables A3, 
A5, Ae; together with an association structure of these variables compatible 
with the graph of Figure 1 . 

An essential quantity to be computed in a Bayesian analysis of the pro- 
posed graphical model is a posterior distribution for the relevant association 
coefficients which, in the Gaussian case, are the marginal and the partial 
correlations corresponding to the edges of the graph. In the next Section we 
consider an asymptotic approach to this problem. 



5. Asymptotic distributions 

Exact prior to posterior analysis for the correlation coefficients of an ar- 
bitrary graphical Gaussian model still poses difficulties concerned with the 
algebric derivation of a tractable posterior distribution. In this Section we 
derive the asymptotic distributions for the marginal and the partial correla- 
tions in a conjugate analysis. 

The standard conjugate prior density for with respect to the product 
of Lebesgue measures on the diagonal and non-zero super-diagonal elements 
of can be deduced from (3) (see Bernardo and Smith, 1994, p.269) and 
has form 

7t(E-^|A^, h) (X exp ^tr(S-U^) -f- ^ log |S-^|| , (4) 

for G Ado (<j)- The hyper parameters are an incomplete matrix € 
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M^(G) and a positive constant h. The posterior distribution has density 
function of the same form with hyper parameters = (hA^ + nS^) /{h + n) 
and m = h + n. By the similarity between (3) and (4) it can be shown 
that the posterior density has maximum at the inverse of the 

completion of 

Let By be the incomplete matrix with main diagonal equal to that of 
Hy and with marginal correlations in the specified off-diagonal positions. 
Clearly can be written as a (bijective) function of Ry = ^(E^), so 
that, by (2), = — </(E“^) is the incomplete matrix with main diagonal 

equal to that of — E~^ and partial correlations in the specified off-diagonal 
positions. Roverato and Whittaker (1998) showed that the asymptotic pos- 
terior distributions for E“^ and E'^ are 

A N[Ta^,m-Hss{Tc\v\v) and Ty A N ,m-Hss{TG)vv) (5) 

respectively. The asymptotic posterior distribution of the transformed pa- 
rameters can be obtained by applying standard delta-type methods (see 
Bernardo and Smith, 1994, p.295). The transformation g{-) can be shown 
to have Jacobian 

J(E^)vv = = diag(p(E^))vv Hvv diag(E'^)vJ, (6) 

where Hw is the matrix indexed by V x V with elements {.ffvv}(i,j),(r,s) equal 
to 1 if i = r and j = s, to —1/2 if either i = r = s^j or j = r = s^i 
and 0 elsewhere. As usual, in taking the derivative in (6) the denominator 
is assumed to be a row vector, the numerator a column vector and the off- 
diagonal elements are considered only once. Applying J(-), evaluated at the 
mode of the posterior density, to (5) we obtain 

{-giTo^), m-V(TG'^)vvIss(TG')vv|v^(T’G'^)'vv) (7) 

and 

R^^N {g{T^), m-V(T^)vvIss(TG)vvJ(T^)'vv) (8) 

which are the required asymptotic distributions. 



7. Application 

In this Section we carry out a prior to posterior analysis of the HIV data 
based on the results of Section 6. 

A critical point in the conjugate Bayesian analysis of the graphical Gaus- 
sian model is the specification of the prior hyper parameters h and A^. The 
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posterior distribution is very sensitive to different prior specifications. Fur- 
thermore note that the number of hyper parameters exceeds the number of 
parameters. 

The positive constant h can be thought of as the number of imaginary 
data points establishing the prior belief, consequently in absence of genuine 
prior expert views it should be small. 

The incomplete matrix Ay reflects the prior belief on the real value of 
E^. Dealing with the saturated model, = A, some authors (Chen, 1979; 
Dickey et al., 1985) proposed to reduce the number of hyper parameters by 
constraining A to have lower dimension structure such as ^ 
where $ = (1— <^)/-|-<^lplp has intraclass correlation structure (with —l/(p— 
1) < 0 < 1 so as to assure $ > 0) and A = diag(5i, . . . , 6p). 

We remark that a similar choice of induces a systematic bias in the 
mean of the asymptotic posterior distributions here considered. To see this, 
consider two sample correlations and rig such that rij > 0 and = —rig. 
In this case the evidence provided by the data is of equal strength in the 
associations relative to Py and pig. In such a situation it is desirable, when no 
otherwise specified in the prior, that this characteristic is maintained in the 
posterior distribution. However, it can be easily checked in g(T^) that any 
choice of (^ > 0 leads to a asymptotic posterior mean of pu closer to zero than 
that of pij. Furthermore a similar behaviour can be empirically observed 
in — p(Tq^). Therefore an intraclass correlation structure of implies 
an asymmetrical inference depending on the signs of the single correlation 
coefficients. Note that setting to 0 would overtake this difficulty but at 
the price of a strong prior information of independence between all variables, 
which is seldom justified. 

Our proposed solution to this problem is to set the specified (f, j)-element 
{i ^ j) of to X sign(ry). Although this is not a pure Bayesian approach 
to the problem, it keeps the hyper parameter dimension low without in- 
troducing a systematic bias in the analysis. In order to make the prior 
distribution not too informative, for the HIV data we set h = l and (p = 0.3 
(this is an admissible value for (p since has positive completion). The 
hyper parameter A is set to diag(10, 10°, 10^, 10°, 10^, 10). 

The resulting asymptotic marginal posterior densities are presented in 
Figure 1. The given marginal distributions are completely specified by their 
mean and variance values. Nevertheless the plot of the posterior densities 
in the [—1,1] interval is useful since it allows an immediate visualisation of 
the densities behaviour around zero. For instance, it can be noted that all 
the correlations involving variable (forth row and column in the picture) 
have high density values at zero. A more detailed analysis showed that all 
of the four corresponding 90% confidence intervals include zero, suggesting 
that the model may be overparametrised. 
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Figure 1: Independence graph of the hypothesised model for HIV diagnostic 
data and marginal asymptotic posterior densities for variances (main diag- 
onal), partial correlations (upper triangle) and marginal correlations (lower 
triangle). 
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8. Conclusions 

Graphical models are specified by set of pairwise conditional independencies 
and this leads to a parametrisation in terms of partial association coeffi- 
cients. Nevertheless also the marginal independence pattern is of relevance 
in the comprehension of the problem under consideration. Figure 1, includ- 
ing both marginal and partial correlations, seems to be an effective device 
to summarise the data structure under a graphical Gaussian model. 

The Spins functions relative to the work of this paper are available from 
the author. 

Acknowledgements: We are grateful to G. Boccuzzo for making the HIV 
data available and helpful discussions concerning its analysis. 
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Abstract: In the Italian financial market, stock fluctuations are highly 
dependent on political and economic events. For this reason, any realistic 
forecast should consider this kind of information. In this paper we show a way 
to include economic and political events in order to forecast a financial time 
series. Then we applied neural networks, econometric analysis and some recent 
non-parametric regression models to empirical data observed over a period of 
61 weeks. The respective performances of the different approaches were then 
compared. 

Keywords: Forecast, Neural Networks, Non-parametric Models, Financial 
Time series. 



1. Introduction 

In the Italian financial stock market, fluctuations are highly dependent on 
political and economic events. This feature has precluded the construction of 
reliable forecasting models. 

In this paper we analyze the prices of listed shares traded on the Italian Stock 
Exchange (Borsa Valori di Milano), with the aim of forecasting the 
performance of the Italian Stock Index-MIB (basis 3/1/94=1000). 

The presence of stock price dependence on political and economic events was 
verified at first glance by scanning the MLB series. Large changes in stock 
prices occur in correspondence to the most relevant events of Italian political 
life. 

Such events could be of a political (i.e. the fall of a government) or a political- 
economic nature (i.e. a reduction in the discount rate). Furthermore, they can be 
classified as national or international events. Usually they have an immediate 
effect on stock prices and most traders use such news to speculate on the 
market. 

Such considerations suggest that a valid prediction of the MB should be 
obtained by adding current political and economic information to previously 
observed values in the series (see i.e. Tivegna, 1996). 
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The data we used refer to the period from April 5, 1994 to Jxme 31, 1995 (61 
weeks). To avoid anomalies associated with the weekly opening and closing of 
the market we considered the Tuesday MEB quotation for each week. 

As predictive variables we took into account: the 10 years BTP Futures, the 
Lira/Dollar exchange rate, the Lira/German Mark exchange rate and the mean 
of the variation rate of the main foreign stock indices (London, N.Y., Frankfurt, 
Tokyo). 

It is true that more detailed data, considering daily indices, over a longer period 
of time could have been analyzed. On the other hand, the aim of this paper is to 
evaluate the feasibility of inserting qualitative exogenous information jointly to 
the use of modem non-parametric methods. The results obtained can be a useful 
starting point towards the construction of a more complex analysis. 



2. Indices of political and economic events 

Qualitative information on national and international economic events was 
derived by skimming the front page of the most important Italian economic 
newspaper “Sole 24 ore”. This newspaper provides a reliable review of the 
main economic and political events in Italy. 

Qualitative indices were constmcted by ranking each political or economic 
news reported on the front page according to previously defined criteria. The 
order in the ranking was based only on the journal’s point of view, to obtain 
data independent from the observer’s opinion. 

Each news item was first classified according to 4 categories: 

1) domestic politics, 2) international politics, 3) domestic economics, 4) 
international economics. 

Then the items were scored according to: 

i) position on the front page: 4=top of the page; 2=center; l=bottom; 

ii) space taken up on the page, measured as the number of columns occupied 
(score 0-2); 

iii) emphasis given to its title (score 0-4). 

The three different scores were added and the resulting score was multiplied by 
a coefficient (+1, 0,-1) according to the journalist’s opinion on the reported 
event (positive, neutral or negative). 

Weekly indices were obtained for each news category by summing up the 
scores reported for each day from Tuesday through to Monday of the previous 
week (omitting Saturdays and Simdays). 

After analyzing the performance of the four political-econortiic indices, only 
three of them (dropping the “international economy” index) seemed to be 
relevant in estimating the MIB index. Thus, we considered the domestic politics 
(DP) and economic (DE) indices and the international politics (IP) index 
referred to the events which occurred between time t and time t-1, together with 
the following lagged (t-1) variables: BTP futures, ITL/US$ rate, MEB index and 
the average of the main stock markets (ASM). 
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The analysis was performed on a data set spanning 61 weeks, using the data of 
the first 55 weeks to define the model and the successive 6 weeks to test the 
forecasts. 



3. Econometric analysis, parametric and non-parametric 
regression models 

The imivariate analysis of financial time series presents particular difiSculties, 
due to the latter’s specific features: very high fi’equency of observed times 
(weekly, daily, intraday); a wide observation period; high and heteroskedastic 
variability; non-significant autocorrelation coefficients. Some empirical 
analysis of financial variables identified a random walk process according to the 
efficient market hypothesis, so the past history of variables could not be used to 
predict future events. In this case the last observation available is the best 
imbiased estimator for future values. 

The econometric analysis of the logarithm of MLB (LMIB) on our data confimis 
this results, highlighting a random walk process. In fact, the autocorrelation and 
partial autocorrelation functions of LMIB identify an AR(I) and the first 
differences LMIB leads to a stationary model (the diagnostics tests Weighted 
Symmetric = -2.1 and Dickey-Fuller = -3.36) with the residual’s 
homoskedasticity (Arch test = 0.49; LR heteroskedasticity test = 0.18; WHITE 
test = 0.68). Furthermore all cointegration tests between LMIB and the other 
variables are not significant, although we have to warn that the sample size was 
insufficient to conduct a correct test. 

In the multiple linear regression model and in the other recessive models we 
considered the MIB to be a dependent variable and the other 7 variables to be 
explicative variables. We applied the following semi-parametric or non- 
parametric models: Projection Pursuit Regression (Friedman and Stuetzle, 
1981), Locd Regression model (Cleveland & Devlin 1988), Generalized 
Additive model (Hastie & Tibshirani 1990), Regression Tree (Breiman et al. 
1984). 

Most of these methods can be considered to be a special case of the general 
expression: 

M 

7 = + (1) 

m=\ 

where X is a vector of K explicative variables. From this point of view, the 
methods differ with respect to the definition of the set of basis functions, 
allowing a unified view of the different approaches (Borra & Di Ciaccio 1998). 
To apply these models we had to fix the smoothing parameters, to avoid 
overfitting the data. To time the parameters the 55 weeks used were repeatedly 
and randomly split into 50 weeks (for the estimation phase) and 5 weeks (for 
the tuning). 
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Surprisingly, we foxmd that the multivariate linear regression model has a good 
fit (R^ = 0.94, Durbin Watson = 1.90). This performance was also obtained by 
means of the conhibution of the political-economic indices which were useful 
in reducing heteroskedasticity and to explain the sudden variations in the MB. 
In the last paragraph we reported a table with the performance of the models 
with respect to the forecast. 

Projection Pursuit Regression appears to be the most flexible model. This 
method uses the following approximation: 



M 

y=Zfm 

m=l 



k 
Vi=l 



im 



\ + S 



( 2 ) 



that is, the dependent variable y is fitted by a sum of M additive fimctions of 
linear combinations of the explicative variables. 

The functions f„ are required to be smooth but are otherwise arbitrary. In this 
way any smooth function of k variables (xi ,X 2 , . . . ,Xk) can be well represented by 
a large M. On the other hand, a bad forecasting performance is usually obtained 
for a large M due to the overfitting of training data. Other parameters affect the 
behavior of the model, for example the smoothness level of the functions/;, , 
and sometimes it is not simple to find the right model. In general, with non- 
parametric models, it is advisable to have a large data-set to allow a reliable 
tuning of the smoothing parameters. 

In our application, using the validation set, we selected M=2 which gave a good 
forecast, indeed a decisively better one than the other regressive models. 



4 , Identification of the neural network model 

The advantages and the limits of the Artificial Neural Networks (ANN) are well 
known in comparison with consolidated statistics methods (Cheng, Titterington, 
1994; Ripley, 1994; Borra & Di Ciaccio 1998). ANN are suitable to the 
description of non linear problems and they have been applied with success in 
numerous forecasting applications. 

ANN cto be seen as a methodology to find a universal approximator with the 
following expression: 
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( 3 ) 



This expression represents a neural network with K input nodes, M hidden 
nodes and one output node. In formula (3), jc,- is /-th input variable, y is the 
dependent variable (output node), Om denotes a weight of the arc linking input 
variable i to hidden nhde m, and denotes a weight of the arc linking hidden 
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node m to output node y. The g»,’s are fixed functions (generally chosen to be 
monotone, in particular sigmoidal functions) and ^ is a linear or sigmoidal or 
threshold function. 

Identifying the appropriate ANN is a critical problem, as it is necessary to 
consider a wide number of choices: the individuation of the input variables; the 
individuation of network architecture; the values to be assigned to the 
parameters of learning. 

One of the problems that occurs with back-propagation and associated networks 
is the problem of over-fitting. In this case, the network performs well on the 
training data, but poorly on independent test data. To deal wdth this problem we 
reduced the network size and split the data as described in the previous 
paragraph, testing the network promotion during the learning phase on the 5 test 
weeks. 

We used 2 types of neural networks: the multilayer feedforward neural network 
(FNN) and the recurrent neural network (RNN). In the former case, input data 
come "forward", fi'om nodes of hidden layers to nodes of the output layer; in the 
latter case, the input layer’s activity patterns pass through the network more 
than once before generating a new output pattern. 

We used a multilayer feedforward neural network with 7 nodes in input, 3 
nodes for the first hidden layer and 2 nodes for the second hidden layer. Several 
alternatives regarding the parameters have been tested. The best results wete 
provided by standard backpropagation, by the sigmoid activation function with 
range [0, 1] (learning rate initially was set at 0.3 and momentum term fixed at 
0.7). 



Figure 1 : Recurrent Neural Network 7-4-2- 1 




Many empirical works have demonstrated that when the time series is non 
linear, the prediction performance of RNN is better than FNN. Indeed, by 
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introducing time-lagged model components, the neural networks may respond 
to the same input in a different way at different times, depending on the 
sequence of inputs. 

Our final neural architecture is composed by two hidden layers, the first layer 
with 4 nodes and the second layer with 2 nodes, and 1 node in a recurrent layer. 
The output layer is fed back into the first hidden layer by means of the node of 
the recurrent layer (in this manner the nodes of the first layer can see their own 
previous output, so that their subsequent behavior can be characterized by 
previous responses). The activation fimction, learning rate and momentum are 
the same as in the previous feedforward neural network. 



Figure 2: Comparison of fit of two models 




5. Comparison of the forecasting performance 

To compare the performance of the models introduced in the previous 
paragraph, it is possible to adopt several forecast evaluation criteria. 

We considered the following indices: 

1) the linear correlation between the observed variation rate of MIB and the 
variation rate on the predicted MIB, denoted by rv; 

2) the index defined as: 




ILb.-y.f 

JeF 

ICv, 



( 4 ) 
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where V is the validation set, is the observed value at time t, is the value 

« 

predicted by the model at time t. Substantially, the PR^ index is the rate of the 
total forecasting error of the model, with the total error obtained assuming a 
random walk process; 

3) the mean square error, MSB, which, being a quadratic loss function, is 
especially useful in the presence of large errors; 

4) die adjusted mean absolute percentage error index, AMAPE, corrects the 
problem of asymmetry between the observed and forecast values: 



AMAPE = 



1 y ^f-A 



( 5 ) 



where n(V) is the total number of observations of F; 

5) the standard index MAPE which is obtained from AMAPE by putting a 
denominator equal to yt, 

6) the percentage of correct sign predictions, CSP, which measures the 
potential profitability of forecasting model in a market trading strategy: 



CSP = 



n(V) 






( 6 ) 



where z< is 1 when (y^ - - yz-i) > 0 and zero otherwise. 

Table 1 shows the values of performance indices obtained by each method. We 
could note that the regression tree and local regression are quite incapable of 
forecasting the validation sample. Linear regression and GAM reach about the 
same results although the latter model is a generalization of a linear approach. 
This can be due to a bit of overfitting of GAM. Neural networks and Projection 
Pursuit give the best performances, with a slight preference for RNN. 



Table 1: Comparison with performance’s indices 



Model 


7 * 

PR^ 


rv 


MSE 


AMAPE% 


MAPE% 


CSP 


Regression 


0.675 


0.635 


329.0 


0.769 


1.552 


66.67 


RNN 7-4-2-1 


0.875 


0.871 


126.9 


0.475 


0.956 


83.33 


FNN 7-3-2-1 


0.839 


0.749 


163.2 


0.538 


1.080 


83.33 


Projection 

Pursuit 


0.843 


0.775 


158.6 


0.563 


1.128 


66.67 


Local 

Regression 


0.492 


0.377 


513.6 


0.870 


1.751 


50.00 


GAM 


0.635 


0.542 


369.6 


0.845 


1.703 


66.67 


Regression 

Tree 


0.147 


0.361 


863.2 


1.086 


2.203 


16.67 
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The results obtained offer interesting suggestions for a comparison of the 
different methods and imderline the importance of the use of exogenous 
qualitative information to improve the forecast of MIB. Of course, these results 
can not be easily generalized to other financial series or to other periods of time. 
In conclusion, we believe that this research can constitute a good starting point 
for the analysis of more extensive data-sets. 
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Abstract; The problem of the best stock location assignment in a warehouse 
has a fundamental role while optimising picking activities. In the present paper, 
this problem has been faced by considering seven variables to compute 
similarity between items. In this context, the problem of the choice of the most 
adequate similarity (or dissimilarity) measure between units while applying 
Multidimensional Scaling (MDS), has been examined. Besides the right metric, 
the possibility of applying a Seriation algorithm has been also considered. By 
using both MDS and seriation not just a single target can be considered, but we 
are able to manage with a plenty of variables; on the contrary with techniques 
used in literature, proper to Operational Research, just a single variable is under 
observation, and therefore just a single goal can be achieved. A wide discussion 
on the results is presented. 

Key words: multidimensional scaling, seriation, similarity measures, stock 
location assignment. 



1. Introduction 

The problem of the best stock location assignment in a warehouse has a 
fundamental role while optimising picking activities. Among the different 
components of times constituting this activity (Mineo, Plaia, 1997 a), times to 
reach the first picking position from I/O (and return) and to reach all the picking 
positions in the picking list, represent the most relevant component: an 
improvement in stock location assignment can reduce significatively this 
component of time. 

In the quoted paper, the use of Multidimensional Scaling (MDS) (Schiffman, 
Reynolds, Young, 1981) has been proposed in order to optimise stock location 



* Authors’ names are listed in alphabetical order. 
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assignment; by using MDS not just a single target can be considered, but we are 
able to manage with a plenty of variables; on the contrary with techniques used 
in literature, proper to Operational Research (Mineo, Plaia, 1997 b), just a 
single variable is under observation (i.e. the picking rate), and therefore just a 
single goal can be achieved. 

The reported results show the usefulness of such a methodology, even if it has 
been applied to the data of a warehouse (of a Sicilian network of super- and 
hyper-markets) where a sub-optimal solution has been gained thanks to human 
experience and know-how. The aim of that paper was to show the adequacy of 
MDS to solve such a problem. From this point of view, the solution reached by 
Mineo & Plaia (1997 a) can be considered promising. Nevertheless we think 
that some of the simplistic solutions adopted while applying MDS, justified by 
the aim of that paper, require much deepness, and this will be the object of the 
present paper. Moreover we shall compare the MDS location assignment with 
the one gained by applying another multivariate technique, the Seriation. 



2. The application 

In order to better deepen the problem mentioned in the precedent section, data 
from a warehouse of a Sicilian chain of super- and hyper-markets have been 
considered. In the warehouse about 2000 different goods are stocked; these can 
be clustered in respect of goods affinity in about 25 classes, according to what 
is suggested by the management of the Private Label to whom the Sicilian chain 
belongs. 

Actually, a lack in homogeneity has been perceived inside some classes. 
Therefore, two different solutions are considered in the present paper in order to 
obtain more homogeneous classes. 

On one side, we have considered just the number of item*-per-good picked in 
the period of time 1.7.1996 to 31.12.1996, so obtaining 44 smaller classes by 
cutting off some of the goods whose behaviour, according to the considered 
variable, was quite different from the other goods in the class. 

On the other side, a hierarchical agglomerative clustering algorithm (single 
linkage algorithm, Hartigan, 1975) has been applied to the 7 variables listed 
below (the same variables, recorded in the same period mentioned above, will 
be used while applying MDS and seriation) inside each class. The resulting 
dendrograms have been used in order to obtain a new set of 44 classes, 
generally different from the former. 

Besides the number of item-per-good picked from the warehouse in the above 
mentioned period (variable 1) the following 6 variables have been considered; 

2. average volume of the item in the class; 

3. item fragility; 

4. -6. average number of items per good per order, for each of the three 



* Item: smallest pickable quantity of good. 




355 



different kind of supermarket in the chain; 

7. number of goods in the class. 

Figures 1 and 2 show two different situations obtained by applying the two 
approaches. As it is possible to notice, on the left side of figure 1 we find a 
good whose behaviour is really different from the other goods in the class 
(according to the number of picked items). On the contrary, on the right side of 
the figure, no isolable goods are present; therefore we get two classes from the 
histogram on the left, while we leave all the goods of the right histogram in a 
single class. Similarly, we maintain a single class for the goods on the left of 
figure 2, while we get 4 classes from the dendrogram on the right. 

A 45^ fictitious class (Mineo, Plaia, 1997 b) has been later added to the 44 in 
order to take into account the necessity to allocate the classes with the highest 
values of the seven variables next to I/O position in the warehouse, as these are 
composed by the most handled items: therefore 45th class represents the I/O 
position. 

In the following sections a similarity matrix among classes is considered in 
order to compare the solutions gained by applying both MDS and seriation. 



Figure 1 : Two different situations by applying the 1^ approach 




Figure 2: Two different situations by emptying the 2"‘^ approach 
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3. New choices in the application of the MDS 

In order to select the proper measure of similarity the opportunity to use 
Gower’s index of similarity (Gower, 1971) has been considered. The problem 
raises because one of our variable «fragility» is not quantitative; nevertheless 
we need to maintain the ranking order (according to the meaning of the 
variable). The index proposed by Gower is consequentially not suitable in our 
case, as it would reduce our variable to a categorical one. So we prefer to treat it 
as a quantitative one and non-metric MDS algorithm is applied to the two 
Euclidean dissimilarity matrices between the 45 classes gained by the two 
approaches described in section 2. 

Criticisms could also be carried out to the use of the 45* fictitious class, used 
from the authors to represent the I/O place of the warehouse, because of the 
way this class has been constructed, i.e. by assigning to each variable the 
corresponding highest value observed on the 44 classes: indeed it could be 
thought that this class introduces a distortion in the final configuration, carrying 
therefore to a procedure of allocation of the goods strongly conditioned just 
from this class; this is due to the fact that MDS is a no-robust technique in 
presence of outhers (Spence, Lewandowsky, 1989). Actually, if we compare the 
final configurations with and without the fictitious class, just a rotation of the 
axes is observable with the first approach (based on variable 1, fig. 3), while a 
distortion in the final configuration is present by using the second approach 
(based on the clustering algorithm, fig. 4) to get the set of 44 classes. Therefore 
just the first approach has been considered to get the solution. 

Moreover, referring to the diagram on the right in fig. 3, dimension 1 seems to 
be the most important to position classes along the aisles of the warehouse, 
considering the meaning of class 45 and the way it has been built (Mineo, Plaia, 
1997 b). 

Figure 3 : MDS solutions with 44 (left) and 45 (right) classes by the I"' approach 
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Figure 4; MDS solutions with 44 O^ft) and 45 (right, the class 45 has co- 
ordinates DJ=6.633, D2=-0.0000 19) classes by the 2"^ approach 




4. Seriation 

In our opinion, further attempts have to be done to validate the choice of a non- 
metric MDS algorithm. We have used the one implemented in the 

STATISTICA® package (StatSoft, Inc. 1995) consisting of minimising, by 
means of the steepest descent method, the so-called raw-stress function, defined 
as 



raw-stress = 






( 1 ) 



where dy are the reproduced distances, given the number of the dimensions, and 
f(8ij) represent the monotonous transformation of the dissimilarities 5ij 
computed on the input data. 

Actually, if we think to the way aisles are travelled in the warehouse, i.e. to the 
‘traversal’ travel pohcy (Caron, Marchet, 1994) which provides for the 
complete crossing of each aisle where at least one item has to be picked (fig. 5), 
a seriation algorithm (Wright, 1985) is applied: as a matter of fact, our problem 
could be presented as placing objects along a continuum. 

A loss function as: 



LF(x)=2](5(,-h-^;i)' (2) 

>■<;■ 

has to be minimised over x = (x^, X 2 , ..., Xn), where x, is the coordinate of object i 
on the axis (think to a single aisle made by the 20 real aisles of the warehouse) 
and 6,; has the usual meaning. 
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Figure 5; Traversal travel policy 




Both the approaches to get the two sets of 44 classes presented in section 2 have 
been considered in order to apply the algorithm. 



5. Results 

In order to compare all the solutions, a C language simulator has been 
developed, which computes the distances travelled in order to satisfy a sample 
of 10 orders drawn for each of the 39 stores of the supermarket chain. 

In fig. 6 the following solutions are compared: 

1. the current stock location assignment; 

2. the non-metric MDS solution; 

3. sedation solution applied to the 44 classes gained by the first approach; 

4. sedation solution applied to the 44 classes gained by the second approach. 

It is immediately evident that the 4* solution is absolutely not competitive, 
being worse than the other sedation solution: so we can conclude that the 
clustedng algodthm is not suitable to get our final set of classes. 

Both the 2”^ and the 3'^“* configurations seem to be better than the current one 
and approximately equivalent one another: it is important to highlight that the 
current assignment is not a starting one for the warehouse, being actually the 
results of years of experience and improvement. 

Concluding, both non-metric MDS and sedation seem to be suitable to improve 
the stock location assignment of a warehouse, being also better than the GOI 
assignment (Mineo, Plaia, 1997 b) which results to be the best solution found in 
literature. 

The comparison between non-metdc MDS and sedation will be repeated as 
soon as the data of a whole year will be available, in order to better deepen 
possible differences in their solutions, by introducing other variables such as a 
seasonal component, which could influence stock location assignment. 
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Figure 6: Distances travelled by the operator to saiisjy a sample of orders for 
each of the 39 stores of the chain with the four technigues 
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Abstract: Nutritional studies are aimed at evaluating implications of food 
behaviour in order to detect possible health problems. Nevertheless, the results can 
be used to plan educational campaigns, regulatory interventions, and so on. In this 
context, food classification can vary according to different criteria. Therefore, food 
coding systems must be flexible enough in order to satisfy the various 
requirements. This approach has been utilised in the INN-CA study carried out by 
the Istituto Nazionale della Nutrizione (INN) in 1995, the characteristics and first 
results are discussed in the present paper. 

Key words: Food coding. Nutritional studies. 



1. Introduction 

Food behaviour represents a complex topic for both the number of variables 
involved in its definition and the number of factors by which it is influenced by. 
Therefore, the study of this phenomenon can be performed according to different 
purposes: economic, socio-cultural, health and other aspects. 

In the nutritional approach food behaviour patterns are analysed in order to 
estimate the “food components” intake, that means, nutrients and non-nutrients 
substances conveyed by foods, and the explanatory factors. This leads to a variety 
of methodologies for approaching the study of nutritional patterns. Particularly, 
nutritional surveys can be classified by “completeness” (number of explicative 
variables included) and “precision” (in measuring foods and food components 
intake) (Bingham, 1987,1991; Cialfa et al, 1991; Fidanza, 1984; Marr, 1971; 
Pekkarinen, 1970; Saba e? a/., 1990, 1992; Turrini, 1991 etal, 1993, 1995; Willet, 
1990). Generally, studies aimed at outlining food behaviour patterns for the whole 
population provide the basic information to plan further studies to examine in- 
depth specific aspects (population groups with specific problems, single nutrients 
or non-nutrients intake, etc.), to project educational campaigns, to define food 
policy interventions, and so on (Turrini, 1993). 

In this context, the food data collection can be performed by utilising either “free- 
writing” forms, such as inventories and diaries (open-ended section) or fixed food 
items list (close-ended section). The first technique provides a detailed picture of 
consumed food products, but it poses some problems in data processing. 



2. Food coding 

Rationale 

The first conceptual problem in food coding is the definition of the properties to be 
considered in data processing. In fact, food sets can be partitioned into many 
subsets according to different points of view (Figure 1), each of them is variously 
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related to aspects of food behaviour and its consequences on nutritional and health 
status of individuals (Saba etal, 1990, 1992). 

Figure 1 ; Factors influencing food classification in nutritional surveys 
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As consequences, two levels of requirements are detected: the identification for 
nutritional evaluation (attribution of composition data) and some level of 
description both for aggregating food items to match categories defined by 
different criteria and to pick out certain sources of undesirable food components 
(additives, residues, contaminants, etc.). 

Furthermore, aggregation is indispensable for reaching statistical significativity in 
consumed quantities. In fact, it is practically impossible to reach the sample size 
sufficient for each single food product. 

Therefore, the set of p food products P = {Pi, Pi,..., Pp} is aggregated in order to 
obtain the variables food^'*^ (^=1,...,A:) that will be constructed as following: 

food^''^ = Pi + P2 + ...+ Pi +... + Pis = 2ie/.P/ 5=1,..., A: (1) 

where the number of categories k depends on the purpose of the analysis (source of 
food components, total diet, comparison with other studies, etc.) and 
Ip=h'uh<J ... <jIs'^ ■■■ '^Ip- 

According to characteristics such as origin, packaging, convenience, preservation 
method, etc., each food products P, belongs to different subsets: 

(Opf^) = ^p.. i=\,...,p) (2) 

where is the subset of food products with the attribute v of the characteristic c 
and f/‘^^ u F 2 ^‘^^ u ... u fJ‘^^ = P. 
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Fixed the characteristic c different list of foods will be obtained for each attribute v: 









(3) 



Structure 

Food coding systems can be classified in two types: hierarchical codes, mixed 
codes (hierarchical and crosswise). In nutritional studies, we utilise a mixed code 
composed by two parts: the first is aimed at identifying the type of products (food 
group, subgroup, other levels of detail) and the second part is dedicated to the 
description of the characteristics (packaged or not, fresh or preserved, etc.). The 
first part is hierarchically organised while the second part is crosswise. This 
organisation is common to systems adopted for food items thesaurus (LanguaL) 
and systems studied for coding food surveys data (Eurocode 2). Besides, specific 
food coding systems are developed depending on objectives, research field, 
national requirements, and so on (Turrini et al, 1992). 

The systems differ from each other mainly for the extension of the second part that 
is strongly related to the study purpose. 



3. A study case 

A pproach 

In general food coding is applicable to three different types of food data (Figure 2): 

1) food products (source: surveys); 

2) food composition (source: chemistry laboratories); 

3) regulatory statements (source: law). 



Figur e 2: Typ ology of food data 
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Major problems rise in the practical application of coding systems to surveyed food 
products. In fact, this operation requires a knowledge of the food products that 
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individuals do not always have. Furthermore, the food products world is 
continuously evolving in order to adapt the supply to the consumer’s requests. 
Consequently, it is very important to define a flexible coding procedure and to 
create a detailed documentation. 

In 1995 a nation-wide food behaviour study, named INN-CA 1995, was carried 
out by the INN. The food section was open-ended and it included 1) household 
inventory, 2) purchase/wastage diary, 3) recipes form, 4) individual diaries. 

The food code was framed in 9 fields as illustrated below: 

Hierarchical part 

1. Group 

2. Subgroup 

3. Detail 1 (variety) 

4. Detail 2 ^rand) 

Crosswise part 

5. Origin (vegetable, animal, mineral, composed) 

6. State (raw, packaged, ready-to-eat) 

7. Treatment (fi-esh, type of preservation) 

8. Other information 

9. Cooking method (prepared at home) 

The objective was to obtain a food products database by recording single food 
names (including varieties and brand name). Because the study was conducted in 
16 different centres, to avoid duplicated codes for diverse foods, a preliminary list 
of coded items was distributed. Therefore, when the food products were not 
present in the previous list all fields were assigned except detail 2 or detail 1, 
conventionally put to 99. 

Results 

Food items have been revised at two levels: a. alphabetical correction of writing 
errors, b. food coding standardisation. 

In table 1 it is clearly shown that the application of the standardised food coding 
caused a strong reduction in the number of the items by itself Therefore, the 
system is efficient with regards to the standardisation in food description. 

It is also evident that the food groups basically originated according to a nutritional 
criterion (main nutrients provided) are not homogeneous according to the number 
of items. In fact, they contain different typologies of food products. For example, 
“Cereals and cereal products” comprehends bread, pasta, rusks, cornflakes, etc., 
“Meat” includes beef, pork, poultry, rabbit, lamb, offal, salami, etc. and the 
preparation could be industrial or not. For this reason food groups are divided into 
subgroups that are more homogeneous and details 1 and 2 indicate single variety or 
packaged food products. 
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Table 1: Number of food items before (A) and after (B) alphabetical correction 








Name 


A 


B 


C 


01 


Cereals and cereal products 


4938 


3924 


448 


02 


Vegetables 


3371 


2524 


701 


03 


Fruit 


1159 


865 


231 


04 


Meat 


2356 


1774 


691 


05 


Sea products 


1097 


858 


284 


06 


Eggs 


87 


60 


18 


07 


Milk and dairy products 


2410 


1809 


305 


08 


Oil and fats 


853 


649 


56 


09 


Sugar and sweets 


3456 


2751 


652 


10 


Beverages 


2433 


1878 


229 


11 


Miscellaneous 


1345 


1102 


283 


12 


Dishes 


2102 


1809 


812 


1 Total 


25607 


20003 


4710 



Codes standardisation caused a stronger reduction than the alphabetical correction 
because it was based on objective characteristics of food products, while the latter 
was affected by the “writing style”. It provided the minimum number of items 
necessary to defined the category surveyed. 



4. Conclusion 

The problem of defining aggregation criteria is central in the elaboration of 
nutritional survey. In surveys that utilise pre-coded food sections the problem must 
be solved a priori, in surveys based on open-ended food section this issue must be 
tackled at the preliminary phase of data processing. 

The food coding system proposed seems to answer to the informatic requirements 
for the organisation of a food products database starting fi-om survey data. 

Further analyses will be performed to test its ability in outlining food sets according 
to different criteria (e.g. grouping by preservation methods, origin, etc.). 

The expected results will be the identification of general food coding procedures in 
relation to different classification criteria. 
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Abstract: UNAIDED is a software program for segmentation analysis. The 
program implements several techniques and criteria for segmenting a set of units 
whatever the measurement scale of the criterion variable. At the present, the 
techniques available are: binary and ternary segmentation, monotone V5. free 
analysis, ranking of predictors, “look-ahead” search of the best split. The 
analytical criterion may be chosen among a large set of implemented criteria. 

Key words: Statistical Data Analysis, Segmentation Analysis, Automatic 
Interaction Detector, Regression Trees, Classification Trees. 



1. Segmentation analysis 

Let Y denote a set of criterion, or dependent variables, and X a set of explanatory 
variables measured on a set of N units. Segmentation analysis is a statistical 
method for stepwise partitioning the set of units with reference to a univariate, 
bivariate, or multivariate distribution of the dependent variables. 

The segmentation procedure partitions the set of units into hierarchical clusters by 
selecting in a stepwise fashion the predictor that minimises the within-cluster 
heterogeneity of the criterion variable(s). While performing a segmentation 
analysis, researchers can insert their prior information and substantive h 3 q)otheses 
for a targeted data processing. 

UNAIDED - UNivariate Automatic Interaction Detector of Empirical Data - is a 
software program for the segmentation analysis with reference to a univariate 
dependent variable measured on any scale. Aims and features of the program, 
together with the statistical rationale of the choices presented to users, are 
discussed in the following. 



2. UNAIDED 

UNAIDED 1.0 is the prototype of a project which aims at offering several options 
of binary and ternary segmentation scattered in several software programs, such as 
AID (Sonquist et al, 1973), ELISEE (Cellard et ah, 1967), THAID (Morgan & 
Messenger, 1973), CART (Breiman et al., 1984), CHAID (Kass, 1980), C4.5 
(Quinlan, 1993). 

The prototype is designed for quantitative, ordinal and nominal dependent variables. 
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The options available* for the analysis are (Scarabello, 1997); 

- binary, ternary, and best-between-binary-and-temary segmentation to disclose 
efficiently non-monotone relations between the dependent variable and an 
ordinal predictor; 

- “free” and “monotone” combination of the categories of candidate predictors 
on ordinal measurement scale; 

- “ranking” of predictors, i.e. the partition of predictors in classes ranked 
according to causality, to control the order of predictor processing. Before the 
segmentation process is started, the user can assign predictors to (up to 4) 
ranked groups of predictors, remote causality groups being processed first and 
late causality ones being processed last; 

- “look-ahead” evaluation of the predictive power of explanatory variables, i.e. 
the analysis of couples, terns, etc. of predictors at each step, in order to select 
the best predictor according to its main and interaction (with the coupled 
predictors) effects. This feature is implemented for just one step ahead, i.e. for 
the evaluation of first order interaction of predictors. 

A single algorithm performs all the analytic options. In fact, the combination of 
each option with the measurement scale of the criterion variable represents a route 
inside the main segmentation engine. To evaluate the goodness-of-split, the 
algorithm follows either of the two implemented segmentation strategies 
(monotone vs. free analysis), and other criteria appropriate to the measurement 
scale of the dependent variable. 

Multiple stopping rules are available for the user’s choice: (a) minimum number of 
observations in the group which is about to be split, (b) minimum number of 
observations in groups candidate for splitting, (c) maximum number of terminal 
groups, (d) minimum (proportion of) variance/entropy explained by a split, (e) 
maximum (proportion of) within-group residual variance/entropy. 

The program is user-friendly and flexible; i.e. the user may specify his/her 
preferences among the offered options. If the user is not able to choose among 
them, a default option is imposed. 



3. Optimisation criteria 



Each split is qualified by a reduction in the within-group heterogeneity of the 
criterion variable. Three basic functions are used in UNAIDED to evaluate the 
reduction in heterogeneity of the within-group^ distribution of Y: 
a) the Minkowski distance, dx, between parameters of the parent and the 
resulting groups. For a quantitative variable Y, the formula is: 



SY.-r 



w. 



1/a 



a> 1 



( 1 ) 



^ Other analytical devices, such as the “pruning” of less important branches (Breiman et al., 
1984) and the “premium for symmetry” of the tree (Sonquist et al, 1973), are in progress. 

^ For statistical analysis, groups are considered categories of a nominal variable. 
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where is the mean of group g, Y is the mean of the parent group and Wg is an 

appropriate weight of group g. Two known distances are derived from (1), the 
mean absolute distance (a=l) and the Euclidean distance (ot=2). For an ordinal 
variable, the mean has to be substituted by the median, and, because of its optimal 
properties, the absolute distance (oc=l) from the median should be considered. 

Each distance may be normalised with its maximum value, which may be either 
the initial between-unit distance, or the between-unit distance of the parent group. 
Two normalised indexes are: (i) Fisher’s T)^, which is an Euclidean distance with 
Wg=Ng /[N a^], where is the population variance and Ng is the size of group g; 
(ii) the relative group absolute distance from the median, which is the group 
absolute distance with Wg=Ng /[N a*], where a* is the parental absolute deviation 
from the median. Both indexes vary between 0 and 1: d. ’=0 if the centre of the 
conditional distributions is the same for all groups, d. ’=1 if the conditional 
distributions maximally differ. 

b) The distance between the observed frequency distribution of variable Y and a 
reference distribution. For a discrete variable Y with K categories, the Minkowski 
distance between observed and predicted values is: 

‘i«=sxipgk-pikrwg, (2) 

g=l k 

where Pgk=rigkln is the observed and p^ = pg, p,k is the predicted frequency of 

category k of the resulting group g under the hypothesis of independence between 
Y and X. Two popular indexes are derived from (2) with a=2: (i) Pearson’s for 
which Wgk=n/ p*^y ^ ; (ii) Goodman-Kruskal’s Xb, for which yvgk^lApgXl-^^py )]. 

may be standardised with Bonferroni’s correction, to account for the degrees of 
freedom of the analysis^, and its maximum, Max(y^)=n{m-1), where m=min(G,K). 
While standardised, d.” varies between 0 and 1: d.”=0 in the case of 
independence between T and the variable whose categories are groups, d. ’=1 if 
groups are maximally different; 

c) The entropy, or uncertainty, of the Y criterion variable distribution within the 
groups resulting after segmentation with variable X. For the segmentation 
analysis with a variable X, Shannon’ s entropy is: 

= H{y) - H(y.x) = -f^p., ln(p., ) + ln(/7„,) (3) 

k g=l k 

where Pkig=Pgk/pg. is the relative frequency of category k conditional to group g. 
The coefficient is normalised with the entropy, H(x), of the classification variable. 
While standardised, Hy/x varies between 0 and 1: ify/t=0 in the case of 
independence between Y and the variable whose categories are groups, Hy/x=\ if 
groups are maximally different. 

Default criteria in UNAIDED are the Euclidean between mean distance for 
quantitative and entropy for nominal variables. 



^ The number of degrees of freedom differs according to the type of analysis, monotone vs. free. 
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4. Program features and algorithm 

UNAIDED has a graphic user interface in Windows style. All program functions are 
accessible through menu items and dialog windows. Required user inputs, such as 
parameters and options for the segmentation analysis, are supplied through standard 
Windows input controls. 

Data may be imported from many file formats; in particular, UNAIDED can 
access all Windows formats supported by the Microsoft Jet engine (i.e. MDB, 
DBF, XLS, Paradox). The program is also an ODBC (Open Data-Base 
Connectivity) client; this means it can access virtually to all data sources, and in 
practice to all database formats whose ODBC driver is installed on the host 
system. For instance, SAS data sets are directly usable, provided a SAS ODBC 
driver is available. 

The tree obtained by the segmentation analysis is displayed by means of a Windows 
Tree control, which allows an easy examination of the node/group statistics. The 
analysis output, that is the tree stmcture and the node/group statistics, is saved in an 
output file, in MBD format, containing a record for each node/group of the tree. This 
allows for reviewing a saved tree output without ranning the analysis again. 
UNAIDED performs segmentation analysis by means of an iterative algorithm, 
processing the segmentation tree for levels, and not for branches as recursive 
algorithms do. The level of the segmentation process is the number of splits 
occurred to isolate a group. Beginning from the root group of size n at the zero 
level, the first split separates two or three subgroups, which are the groups of the 
first level, and so on. 

The algorithm examines and eventually splits each group at current level. To split 
a group, for each predictor, all the possible splits, according to the type of 
predictor, are considered and for each one, the split function is computed to 
ev£iluate the gain in within-group homogeneity following that split. The best splits 
of all predictors are then compared, and the very best one is found, among those 
considered. The procedure is repeated until no group of current level can be split. 
The search of the best split is extended to the set of splits based on a single 
predictor at a time. Admissible splits are then determined according to the type of 
predictor: when a predictor is denoted as “monotone”, the order of its categories is 
kept fixed, that is only adjacent categories on the ordinal scale can be. associated 
to identify a group; when the predictor is denoted “free”, any combination of its 
categories is plausible. 

To make the search computationally practicable when the analysis is free, 
predictor categories are ordered according to a function consistent with the split 
criterion function; then the algorithm proceeds as in the case of monotone 
analysis. For instance, when the dependent variable is quantitative and rf split 
criterion is used, predictor categories are ordered according to conditional 
averages of the criterion variable (Fisher, 1958); in this case the solution found is 
the very best one. For binary segmentation with a nominal dependent variable and 
Entropy split criterion, predictor categories are firstly ordered according to the 
conditional entropy of the Y variable. In this case, however, the found solution is 
not necessarily the best. 
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The standard segmentation, based on a single predictor at a time, is a procedure 
locally optimal for the single step, but the overall solution, obtained by the 
stepwise procedure, is not the best partition of the data set according to the 
criterion. The algorithm of UNAIDED implements the Look-ahead option, which 
enhances the overall solution by noting interactions between the currently 
processed predictor and all other predictors at the next step. 

No restriction is set about the number of observations and variables to be 
processed by the program; limitations are determined only by available memory 
and computing time. 



5. Technical section 

UNAIDED 1.0 is a program for personal computers based on Intel processors and mns 
under Windows 95 operative system. It should also run under Windows NT 4.0 
although further testing is needed for the latter environment. The program needs a set- 
up to install some Windows shared components (DLL and OCX) coming with Visual 
C++ developing environment. To use UNAIDED, it is advisable to have a system with 
a 586 class processor and at least 16 Mb of RAM. 

The program is coded in VISUAL C++ 4.0 and developed under Windows 95 
environment. Program structure was designed according to specifications of the 
document-view architecture, the object oriented application framework settled in 
MFC 4.0 (Microsoft Foundation Classes) object class library. The development of the 
first prototype took about 10 months of man work, but the program is still in progress. 



6. A sample application 

An application, which shows how the program works with a typical problem of 
classification, is presented. The well known Iris data set is used to exemplify the 
comparison between binary and ternary segmentation. 

Data are directly get from a SAS Dataset through the File Open menu function, by 
specifying the ODBC SAS data source (Fig. 1). From the list of on-line SAS datasets 
in the defined SAS libnames, the Iris dataset is picked up. Read data are browsed in a 
grid window, and a dialog window allows specifying the Iris species as nominal 
criterion variable and the four variables, petal and sepal height and length, as 
monotone predictors (Fig. 2). 

Binary versus ternary segmentation is performed, using Entropy as split criterion and 
keeping fixed all the other options; in particular, stopping rules were set on minimum 
values to get the maximum growth of the tree. Figures 3 and 4 report the resulting 
trees in the output window: by clicking on a node, one gets the node statistics 
displayed on the left part of this window. The variable sepal height dominates both 
analyses, being selected repeatedly as the best split predictor. Ternary segmentation 
produces a tree simpler and compact, much easily readable than the binary tree. 
Multiple segmentation could do even better, separating in a single split the effect of 
this predictor. The Look-ahead option does not improve the results, giving the same 
tree of ternary segmentation without Look-ahead. This outcome suggests no 
interaction among predictors is present in the data. 
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Fig. 2: The Data Browser Window and the Variable Selection Dialog Window 
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Fig 3: The Output Dialog Window: binary segmentation of Iris data set 




Fig 4: The Output Dialog Window: ternary segmentation of Iris data set 
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7. Conclusions 

Program requires further development. Not all the considered splitting criteria are 
supported. Some important functionality is lacking, such as an output printing 
function. 

The present version of the program is the first step of a project for the 
development of statistical methods based on segmentation. Some important 
developments are designing to extend the analysis capabilities of the program: 

- larger set of split functions 

- best split search extended to split based on combinations of predictors. 

We plan to distribute the software on the FTP site ftp.stat.unipd.it, as soon as the 
necessary test phase will be completed. 
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